indexing xml messages

View: New views
2 Messages — Rating Filter:   Alert me  

indexing xml messages

by vsevel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi, the following junit test fails on 3 out of the 6 searches:

    @Test
    public void indexXML() throws Exception {
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        RAMDirectory dir = new RAMDirectory();
        IndexWriter writer = new IndexWriter(dir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
        Document doc = new Document();
        String xml = FileHelper.readFileContent("lucene_work/myxml.xml");
        doc.add(new Field("myxml", xml, Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
        writer.addDocument(doc);
        writer.close();
       
        IndexReader reader = IndexReader.open(dir, true); // only searching, so read-only=true
        Searcher searcher = new IndexSearcher(reader);
        // Assert.assertEquals(1, searcher.search(new TermQuery(new Term("myxml", "123AB")), 1).totalHits);
        Assert.assertEquals(1, searcher.search(new TermQuery(new Term("myxml", "reference")), 1).totalHits);
        // Assert.assertEquals(1, searcher.search(new TermQuery(new Term("myxml", "operationImpact")), 1).totalHits);
        Assert.assertEquals(1, searcher.search(new TermQuery(new Term("myxml", "data")), 1).totalHits);
        // Assert.assertEquals(1, searcher.search(new TermQuery(new Term("myxml", "EFG")), 1).totalHits);
        searcher.close();
        reader.close();
    }

given this xml message:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<operationImpact>
        <reference value="123AB"/>
        <data>EFG</data>
</operationImpact>

How do I get this to work? My goal is to be able to do full text search on XML documents. This includes tags, attribute values and tag values.

Thanks,
vince

Re: indexing xml messages

by Ian Lea :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

StandardAnalyzer will, amongst other things, convert everything to
lowercase which means that term queries on mixed or upper case text
will fail to match.

There is some info on indexing XML docs in the FAQ
http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_index_XML_documents.3F
and I'm sure that Google would find loads more stuff.

And Luke is invaluable for seeing what your index really holds.


--
Ian.


On Tue, Nov 3, 2009 at 7:40 AM, vsevel <v.sevel@...> wrote:

>
> Hi, the following junit test fails on 3 out of the 6 searches:
>
>    @Test
>    public void indexXML() throws Exception {
>        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
>        RAMDirectory dir = new RAMDirectory();
>        IndexWriter writer = new IndexWriter(dir, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
>        Document doc = new Document();
>        String xml = FileHelper.readFileContent("lucene_work/myxml.xml");
>        doc.add(new Field("myxml", xml, Field.Store.YES,
> Field.Index.ANALYZED));
>        doc.add(new Field("id", "1", Field.Store.YES,
> Field.Index.NOT_ANALYZED));
>        writer.addDocument(doc);
>        writer.close();
>
>        IndexReader reader = IndexReader.open(dir, true); // only searching,
> so read-only=true
>        Searcher searcher = new IndexSearcher(reader);
>        // Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "123AB")), 1).totalHits);
>        Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "reference")), 1).totalHits);
>        // Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "operationImpact")), 1).totalHits);
>        Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "data")), 1).totalHits);
>        // Assert.assertEquals(1, searcher.search(new TermQuery(new
> Term("myxml", "EFG")), 1).totalHits);
>        searcher.close();
>        reader.close();
>    }
>
> given this xml message:
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <operationImpact>
>        <reference value="123AB"/>
>        <data>EFG</data>
> </operationImpact>
>
> How do I get this to work? My goal is to be able to do full text search on
> XML documents. This includes tags, attribute values and tag values.
>
> Thanks,
> vince
> --
> View this message in context: http://old.nabble.com/indexing-xml-messages-tp26160016p26160016.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@...
> For additional commands, e-mail: java-user-help@...
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...