Lucene oddity

View: New views
10 Messages — Rating Filter:   Alert me  

Lucene oddity

by Jason Smith-11 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

1.3.0dev-rev9849

 

This is the first time I have worked with Lucene, so I don't know if (or how) this worked in previous versions of Exist.

 

I have the following configuration:

 

<collection xmlns="http://exist-db.org/collection-config/1.0">

<index xmlns:atom="http://www.w3.org/2005/Atom"

xmlns:html="http://www.w3.org/1999/xhtml"

xmlns:wiki="http://exist-db.org/xquery/wiki">

<!-- Disable the standard full text index -->

<fulltext default="none" attributes="no"/>

<!-- Lucene index is configured below -->

<lucene>

<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

<analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>

<text match="*|@*"/>

</lucene>

</index></collection>

 

I run this XQuery:

 

import module namespace lucene = 'http://exist-db.org/xquery/lucene';

//SPEECH[lucene:query(., 'lord')]

 

And as you would expect, I get a bunch of <SPEECH/> elements back.  Great!  So far, so good. 

 

However, if I change to this (and reindex):

 

<text match="//SPEECH//*"/>

 
I get nothing. Nada. Zip. Yes, I am doing reindexing consistently.  And I had Paul Ryan look at it as well, and he's stumped (he's better with the deep Exist configuration than I am).
 
I'm pulling in both "Much Ado about Nothing" and "A Comedy of Errors".  The text match element is straight from your documentation for Lucene, so I would expect it to work. 
 
Any ideas on different things I could try to figure this out?  Am I using it wrong? 
 

Jason Smith


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Lucene oddity

by Michael Sokolov-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
I think it's recommended to use qname indexes now.  I don't *think* the path-based indexing was disabled: can't give you a definite answer there, but what works for me is:

<collection xmlns="http://exist-db.org/collection-config/1.0">
     <index xmlns:ifp="http://www.ifactory.com/press"
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <fulltext default="none" attributes="no"/>
         <lucene>
             <text qname="SPEECH" />
         </lucene>
     </index>
</collection>

Jason Smith wrote:

1.3.0dev-rev9849

 

This is the first time I have worked with Lucene, so I don't know if (or how) this worked in previous versions of Exist.

 

I have the following configuration:

 

<collection xmlns="http://exist-db.org/collection-config/1.0">

<index xmlns:atom="http://www.w3.org/2005/Atom"

xmlns:html="http://www.w3.org/1999/xhtml"

xmlns:wiki="http://exist-db.org/xquery/wiki">

<!-- Disable the standard full text index -->

<fulltext default="none" attributes="no"/>

<!-- Lucene index is configured below -->

<lucene>

<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>

<analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>

<text match="*|@*"/>

</lucene>

</index></collection>

 

I run this XQuery:

 

import module namespace lucene = 'http://exist-db.org/xquery/lucene';

//SPEECH[lucene:query(., 'lord')]

 

And as you would expect, I get a bunch of <SPEECH/> elements back.  Great!  So far, so good. 

 

However, if I change to this (and reindex):

 

<text match="//SPEECH//*"/>

 
I get nothing. Nada. Zip. Yes, I am doing reindexing consistently.  And I had Paul Ryan look at it as well, and he's stumped (he's better with the deep Exist configuration than I am).
 
I'm pulling in both "Much Ado about Nothing" and "A Comedy of Errors".  The text match element is straight from your documentation for Lucene, so I would expect it to work. 
 
Any ideas on different things I could try to figure this out?  Am I using it wrong? 
 

Jason Smith



------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july


_______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Lucene oddity

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

 > <text match="//SPEECH//*"/>

The pattern syntax for match looks like XPath, but it is not. In
particular, // is a bit counter-intuitive: match="//SPEECH//*" includes
all descendant nodes of SPEECH, but not SPEECH itself. I will think
about changing this, but I'm not yet sure how (maybe we should choose
other separators than /, so it doesn't look like XPath).

Anyway, with your configuration,

//SPEECH[lucene:query(LINE, 'lord')]

should return matches while

//SPEECH[lucene:query(., 'lord')]

does not since SPEECH itself has no index.

If you create an index on SPEECH, you can query SPEECH, but not its
child LINE, and vice versa.

Wolfgang

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Lucene oddity

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I think it's recommended to use qname indexes now.  

We recommend to use a qname instead of a "path" for the range and old
full text indexes. The new indexes (Lucene, N-gram) don't accept the old
type of "path" definition. Background: for performance reasons, it is
best if eXist knows at compile time, what indexes are available. This
wasn't possible with the old indexes on "path".

However, since people started complaining, I reintroduced a somewhat
similar - though different behind the scenes - feature for the Lucene
index, which now accepts a "match" attribute with a path. I'm not sure
if we should keep that. Maybe it would be better to make the semantics
more explicit, e.g.

<text qname="SPEECH" descend="yes"/>

which would index SPEECH and all descendants below it.

Wolfgang

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Lucene oddity

by Michael Sokolov-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm not sure if the path syntax allows you to do more than you can with

<text qname="SPEECH" descend="yes|no"/>

but if it doesn't, this seems better.  I think this syntax is probably
easier to understand; the other one seems as if it could easily mislead
one into thinking xpath is available, as you said.

Also: it's been a little while since I worked on this, but I thought I
remembered the default being descend="yes", effectively, with the option
of specifying <ignore> elements to exclude some content - is that right?

-Mike

Wolfgang wrote:

>> I think it's recommended to use qname indexes now.  
>
> We recommend to use a qname instead of a "path" for the range and old
> full text indexes. The new indexes (Lucene, N-gram) don't accept the
> old type of "path" definition. Background: for performance reasons, it
> is best if eXist knows at compile time, what indexes are available.
> This wasn't possible with the old indexes on "path".
>
> However, since people started complaining, I reintroduced a somewhat
> similar - though different behind the scenes - feature for the Lucene
> index, which now accepts a "match" attribute with a path. I'm not sure
> if we should keep that. Maybe it would be better to make the semantics
> more explicit, e.g.
>
> <text qname="SPEECH" descend="yes"/>
>
> which would index SPEECH and all descendants below it.
>
> Wolfgang

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Lucene oddity

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Also: it's been a little while since I worked on this, but I thought I
> remembered the default being descend="yes", effectively, with the option
> of specifying <ignore> elements to exclude some content - is that right?

No, <text qname="SPEECH"/> creates an index ONLY on SPEECH. What is
passed to Lucene is the string value of SPEECH, which includes the text
of all its descendant text nodes, *except* those filtered out by an
optional <ignore>. For example, consider the fragment:

<SPEECH>
   <SPEAKER>Second Witch</SPEAKER>
   <LINE>Fillet of a fenny snake,</LINE>
   <LINE>In the cauldron boil and bake;</LINE>
</SPEECH>

If you have an index on SPEECH, Lucene will create a "document" with the
text "Second Witch Fillet of a fenny snake, In the cauldron boil and
bake;" and indexes it. eXist internally links this Lucene document to
the SPEECH node, but Lucene has no knowledge of that (it doesn't know
anything about XML nodes).

The query:

//SPEECH[ft:query(., 'cauldron')]

searches the index and finds the "document" containing the SPEECH text,
which eXist can trace back to the SPEECH node in the XML document.
However, it is required that you use the same context (SPEECH) for
creating and querying the index. The query:

//SPEECH[ft:query(LINE, 'cauldron')]

will not return anything, even though LINE is a child of SPEECH and
'cauldron' was indexed. This particular 'cauldron' is linked to its
ancestor SPEECH node, not its parent LINE.

However, you are free to give the user both options, i.e. use SPEECH and
LINE as context at the same time. How? Simply define a second index on LINE:

<text qname="SPEECH"/>
<text qname="LINE"/>

Concerning <ignore> and <inline>: every text string is passed through
Lucene's analyzer before it is indexed. eXist's <ignore> and <inline>
configuration tags simply allow you to slightly modify the text before
Lucene sees it. The config

<text qname="SPEECH"><ignore qname="SPEAKER"/></text>

removes the SPEAKER part, so Lucene will only see "Fillet of a fenny
snake, In the cauldron boil and bake;".

I hope this helps to clarify the issue. I think I will add this
explanation to the documentation.

Wolfgang

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Parent Message unknown Re: Lucene oddity

by Jason Smith-11 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

You are correct that when we roll out to production, we'll want to optimize our indexes a bit.  It's very likely we'll only need to index on a few names.  But in development, we don't know what those are going to be yet!  :-)  So as long as the performance isn't abysmal on small data sets, I am willing to give up some performance for guaranteed *correct* results.  At least for now... in development...

And we'll need to do a good job of documenting the *correct* configuration settings for production installations. :-)

By the way, the scoring feature works great!  I love it!

And thanks again for all the detailed technical information on this.  It seems really quite usable.

Jason Smith
________________________________________
From: Wolfgang [wolfgang@...]
Sent: Friday, September 04, 2009 3:22 PM
To: Jason Smith
Subject: Re: [Exist-open] Lucene oddity

> I have one more question - what is the best way to enable full-text
> indexing (Lucene based, of course) for all text in a collection? Is
> there a reason I might not want to do this?

You can enable full text indexing for an entire document text by putting
an index on the top-level element of that document. However, this is not
the same as doing: <text match="//*"/> since that will create an
individual index on *every node* within the document and thus generate a
lot of redundancy.

As I tried to explain, you have to figure out which parts of your
document will likely be interesting as context for a full text query.
The full text index will work best if the context isn't too narrow. For
example, if you have a document structure with section divs, headings
and paragraphs, I would probably want to create an index on the divs and
maybe on the headings, so the user can differentiate between the two. In
some cases, I could decide to put the index on the paragraph level, but
then I don't need the index on the section since I can always get from
the paragraph back to the section.

Wolfgang
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Lucene oddity

by VanP :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

So I'm trying to do a Lucene full-text index on an attribute value.  The following works fine:

<text match="//@id"/>

But the documentation says that the "match" syntax is experimental and may be removed.  I tried the following using the qname syntax, but neither worked:

<text qname="@id"/> or <text qname="id"/>

How do you index an attribute value using the qname syntax?

Re: Lucene oddity

by Roy Walter-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

<text qname="@id"/>

Did you reindex?

-- Roy

VanP wrote:

> So I'm trying to do a Lucene full-text index on an attribute value.  The
> following works fine:
>
> <text match="//@id"/>
>
> But the documentation says that the "match" syntax is experimental and may
> be removed.  I tried the following using the qname syntax, but neither
> worked:
>
> <text qname="@id"/> or <text qname="id"/>
>
> How do you index an attribute value using the qname syntax?
>  

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Lucene oddity

by VanP :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Roy Walter-2 wrote:
<text qname="@id"/>

Did you reindex?
Well, I thought I did, but I just tried it and now it works.  I wasn't sure I had the right syntax as the documentation doesn't show adding an attribute as an index.  Thanks for getting me to try it again.

Paul