[jira] Created: (LUCENE-1522) another highlighter

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 - 4 | Next >

[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682492#action_12682492 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------

{quote}
I'd also like it to work if you don't have termvectors stored (though
be faster if they are perhaps, as it is now).
{quote}

I agree.

{quote}
Getting hit positions for position sensitive clauses requires
converting the query to a span query and calling getSpans on a memory
index
{quote}

Is the reason why H1 creates the full token stream (even when
TermVectors is the source) in order to build the MemoryIndex?

If term vectors (w/ positions, offsets) were stored, wouldn't it be
possible to make a simple index (or at least TermDocs, TermPositions)
wrapped on those TermVectors?


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682494#action_12682494 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------


{quote}
But, if you have for example a document 'a b c a b c' and the query
'a AND b', then this approach would only highlight the first two terms,
no?
{quote}

Ahh right -- in fact, nothing would be highlighted because the scorer
for AND queries doesn't visit positions at all (it doesn't need to).

I guess we'd have to ask such scorers to forcefully visit positions &
enumerate all matches within one doc, when running in "highlight"
mode.  Hmm, feeling like a big change...

But maybe it could work.  It'd be sort of like a positional-aware
"explain", ie "show me the term occurrences that allowed the full
query to accept this document".

Imagine query "(a AND b) OR (c AND d)".  When looking at the fragments
for each doc, I would want to see both a AND b, or both c AND d, but
never (for example) just a and d.

But, flattening could produce just a and d (I think?); and I think H1
could do the same even with SpanScorer (Mark is that true?  I don't
fully understand the Query -> SpanQuery conversion).

Whereas if we could ask for positions of the "real" matches I think it
would work correctly?


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682502#action_12682502 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------

bq. Not sure it solves being able to gets offsets from the query terms and still mask for positions though

Can you explain that more?


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682506#action_12682506 ]

Mark Miller commented on LUCENE-1522:
-------------------------------------

{quote}Is the reason why H1 creates the full token stream (even when
TermVectors is the source) in order to build the MemoryIndex?

If term vectors (w/ positions, offsets) were stored, wouldn't it be
possible to make a simple index (or at least TermDocs, TermPositions)
wrapped on those TermVectors? {quote}

It creates the full tokenstream because it was designed to work without termvectors, and so without offset info for the query terms, it rebuilds the stream and processes a token at a time - the api gives you hooks to highlight at any of these tokens - thats essentially the bottleneck I think - taking everything a token at a time, but the whole API is based on that fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it was convenient to fit into the current highlighter API to start. I had it in my mind to break from the API and make a largedoc highlighter that didn't need termvectors, but I found the memory index and getspans to still be too slow in my initial testing. I'd hoped to work more on it, but havn't had a chance. So essentially, while more can be done with termvectors, the improvements break the current API at a pretty deep level - no one has done the work to solve that I guess - which is why we have the alternate highlighters.


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682515#action_12682515 ]

Marvin Humphrey commented on LUCENE-1522:
-----------------------------------------

> It'd be sort of like a positional-aware "explain", ie "show me the term
> occurrences that allowed the full query to accept this document".

FWIW, this is more or less how the KinoSearch highlighter now works in svn
trunk.  It doesn't use a Scorer, though, but instead the KS analogue to
Lucene's "Weight" class.

The (Weight) is fed what is essentially a single doc index, using stored term
vectors.  Weight.highlightSpans() returns an array of "span" objects, each of
which has a start offset, a length, and a score.  The Highlighter then
processes these span objects to create a "heat map" and choose its excerpt
points.

The idea is that by delegating responsibility for creating the scoring spans, we
make it easier to support arbitrary Query implementations with a single
Highlighter class.

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Issue Comment Edited: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682506#action_12682506 ]

Mark Miller edited comment on LUCENE-1522 at 3/16/09 4:43 PM:
--------------------------------------------------------------

{quote}Is the reason why H1 creates the full token stream (even when
TermVectors is the source) in order to build the MemoryIndex?

If term vectors (w/ positions, offsets) were stored, wouldn't it be
possible to make a simple index (or at least TermDocs, TermPositions)
wrapped on those TermVectors? {quote}

It creates the full tokenstream because it was designed to work without termvectors, and so without offset info for the query terms, it rebuilds the stream and processes a token at a time - the api gives you hooks to highlight at any of these tokens - thats essentially the bottleneck I think - taking everything a token at a time, but the whole API is based on that fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it was convenient to fit into the current highlighter API to start. I had it in my mind to break from the API and make a largedoc highlighter that didn't need termvectors, but I found the memory index and getspans to still be too slow in my initial testing. I'd hoped to work more on it, but havn't had a chance. So essentially, while more can be done with termvectors, the improvements break the current API at a pretty deep level - no one has done the work to solve that I guess - which is why we have the alternate highlighters.

*edit*

I suppose one of the main problems with my briefly tested large doc approach I tried is that it still requires that you rebuild the tokenstream (and I was attempting to not use termvectors either).  Avoiding the need for that would probably make it much more competitive.

      was (Author: markrmiller@...):
    {quote}Is the reason why H1 creates the full token stream (even when
TermVectors is the source) in order to build the MemoryIndex?

If term vectors (w/ positions, offsets) were stored, wouldn't it be
possible to make a simple index (or at least TermDocs, TermPositions)
wrapped on those TermVectors? {quote}

It creates the full tokenstream because it was designed to work without termvectors, and so without offset info for the query terms, it rebuilds the stream and processes a token at a time - the api gives you hooks to highlight at any of these tokens - thats essentially the bottleneck I think - taking everything a token at a time, but the whole API is based on that fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it was convenient to fit into the current highlighter API to start. I had it in my mind to break from the API and make a largedoc highlighter that didn't need termvectors, but I found the memory index and getspans to still be too slow in my initial testing. I'd hoped to work more on it, but havn't had a chance. So essentially, while more can be done with termvectors, the improvements break the current API at a pretty deep level - no one has done the work to solve that I guess - which is why we have the alternate highlighters.

 

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682609#action_12682609 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------

{quote}
> It'd be sort of like a positional-aware "explain", ie "show me the term
> occurrences that allowed the full query to accept this document".

FWIW, this is more or less how the KinoSearch highlighter now works in svn
trunk. It doesn't use a Scorer, though, but instead the KS analogue to
Lucene's "Weight" class.

The (Weight) is fed what is essentially a single doc index, using stored term
vectors. Weight.highlightSpans() returns an array of "span" objects, each of
which has a start offset, a length, and a score. The Highlighter then
processes these span objects to create a "heat map" and choose its excerpt
points.

The idea is that by delegating responsibility for creating the scoring spans, we
make it easier to support arbitrary Query implementations with a single
Highlighter class.
{quote}

Awesome!

Do you require term vectors to be stored, for highlighting (cannot
re-analyze the text)?

For queries that normally do not use positions at all (simple AND/OR
of terms), how does your highlightSpans() work?

For BooleanQuery, is coord factor used to favor fragment sets that
include more unique terms?

Are you guaranteed to always present a net set of fragments that
"matches" the query? (eg the example query above).

I think the base litmus test for a hightlighter is: if one were to
take all fragments presented for a document (call this a "fragdoc")
and make a new document from it, would that document match the
original query?

In fact, I think the perfect highlighter would "logically" work as
follows: take a single document and enumerate every single possible
fragdoc.  Each fragdoc is allowed to have maxNumFragments fragments,
where each fragment has a min/max number of characters.  The set of
fragdocs is of course ridiculously immense.

Take this massive collection of fragdocs and build a new temporary
index, then run your Query against that index.  Many of the fragdocs
would not match the Query, so they are eliminated right off (this is
the litmus test).  Then, of the ones that do, you want the highest
scoring fragdocs.

Obviously you can't actually implement a highlighter like that, but I
think "logically" that is the optimal highlighter that we are trying
to emulate with more efficient implementations.

I think having the Query/Weight/Scorer class be the single-source for
hits, explanation & highlight spans is the right approach.  Having a
whole separate package trying to reverse-engineer where matches had
taken place between Query and Document is hard to get right.  EG
BooleanScorer2's coord factor would naturally/correctly influence the
selection.

I also think building a [reduced, just Postings] IndexReader API on top of
TermVectors ought to be a simple way to get great performance here.


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682672#action_12682672 ]

Marvin Humphrey commented on LUCENE-1522:
-----------------------------------------

> Do you require term vectors to be stored, for highlighting (cannot
> re-analyze the text)?

Yes, but that's not fundamental to the design.  You just have to hand the
Weight some sort of single-doc index that includes sufficient data to
determine what parts of the text contributed to the hit and how much they
contributed.  The Weight needn't care whether that single-doc index was
created on the fly or stored at index time.

> For queries that normally do not use positions at all (simple AND/OR
> of terms), how does your highlightSpans() work?

ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
spans produced by their children.

> For BooleanQuery, is coord factor used to favor fragment sets that
> include more unique terms?

No; I don't think that would be fine grained enough to help.

There's a HeatMap class that performs additional weighting.  Spans that
cluster together tightly (i.e. that could fit together within the excerpt) are
boosted.

> Are you guaranteed to always present a net set of fragments that
> "matches" the query? (eg the example query above).

No.  The KS version supplies a single fragment.  It naturally prefers
fragments with rarer terms, because the span scores are multiplied by the
Weight's weighting factor (which includes IDF).  

Once that fragment is selected, the KS highlighter worries a lot about
trimming to sensible sentence boundaries.

In my own subjective judgment, supplying a single maximally coherent fragment
which prefers clusters of rare terms results in an excerpt which "scans" as
quickly as possible, conveying the gist of the content with minimal "visual
effort".  I used Google's excerpting as a model.

> I think the base litmus test for a hightlighter is: if one were to
> take all fragments presented for a document (call this a "fragdoc")
> and make a new document from it, would that document match the
> original query?

With out the aid of formal studies to guide us, this is a subjective call.
FWIW, I disagree.  In my view, visual scanning speed and coherence
are more important than completeness.  

I'm not a big fan of the multi-fragment approach, because I think it takes too
much effort to grok each individual entry.  Furthermore, the fact that the
fragments don't start on sentence boundaries (whenever feasible) adds to the
visual effort needed to orient yourself.

Search results contain a lot of junk.  The user needs to be able to parse the
results page as quickly as possible and refine their search query as needed.
Noisy excerpts, with lots of elipses and few sentences that can be "swallowed
whole" impede that.  Trees vs. Forest.

Again, that's my own aesthetic judgment, but I'll wager that there are studies
out there showing that fragments which start at the top of a sentence are
easier to consume, and I think that's important.

> In fact, I think the perfect highlighter would "logically" work as
> follows: take a single document and enumerate every single possible
> fragdoc.

KS uses a sliding window rather than chunking up the text into fragdocs of
fixed length.

> Having a whole separate package trying to reverse-engineer where matches had
> taken place between Query and Document is hard to get right.

Exactly.

PS: Obviously, refinements of the highlighting algo will help Lucy, too. I
don't suppose you want to continue this on the Lucy dev list so that Lucy
banks some community credit for this discussion.  :\

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682688#action_12682688 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------

{quote}
PS: Obviously, refinements of the highlighting algo will help Lucy, too. I
don't suppose you want to continue this on the Lucy dev list so that Lucy
banks some community credit for this discussion. :\
{quote}

Well... remember that more discussions between you and I and Nathan on
Lucy-dev (as much as I love having them) don't really "count" as a
"bigger" community.  In other words, like the scoring of a
BooleanQuery, there is a very strong coord factor at play when
measuring "community".  If you and I and nathan have fewer
conversations on Lucy-dev, but then two other new people join in, that
is a much stronger community.

So, maybe send a note to lucy-dev, referencing this as a relevant
discussion to Lucy's approach to highlighting... and leave a
tantalizing invite here for others to make the jump to lucy-dev.
Growing a community is not easy!


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682689#action_12682689 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------

{quote}
> Do you require term vectors to be stored, for highlighting (cannot
> re-analyze the text)?

Yes, but that's not fundamental to the design. You just have to hand the
Weight some sort of single-doc index that includes sufficient data to
determine what parts of the text contributed to the hit and how much they
contributed. The Weight needn't care whether that single-doc index was
created on the fly or stored at index time.
{quote}

OK.

{quote}
> For queries that normally do not use positions at all (simple AND/OR
> of terms), how does your highlightSpans() work?

ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
spans produced by their children.
{quote}

Hmm -- it seems like that loses information.  Ie, for ANDQuery, you
lose the fact that you should try to include a match from each of the
sub-clauses' spans.

{quote}
> For BooleanQuery, is coord factor used to favor fragment sets that
> include more unique terms?

No; I don't think that would be fine grained enough to help.
{quote}

What I meant was: all other things being equal, do you more strongly
favor a fragment that has all N of the terms in a query vs another
fragment that has fewer than N but say higher net number of
occurrences.

{quote}
There's a HeatMap class that performs additional weighting. Spans that
cluster together tightly (i.e. that could fit together within the excerpt) are
boosted.
{quote}

That sounds great.

{quote}
> Are you guaranteed to always present a net set of fragments that
> "matches" the query? (eg the example query above).

No. The KS version supplies a single fragment. It naturally prefers
fragments with rarer terms, because the span scores are multiplied by the
Weight's weighting factor (which includes IDF).
{quote}

Hmm OK.

{quote}
Once that fragment is selected, the KS highlighter worries a lot about
trimming to sensible sentence boundaries.
{quote}

I totally agree: easy/fast consumability is very important, so
choosing entire sentences, or at least anchoring the start or maybe
end on a sentence boundary, is important.  Lucene's H1 doesn't do this
ootb today I think (though you could plug in your own fragmenter).

{quote}
In my own subjective judgment, supplying a single maximally coherent fragment
which prefers clusters of rare terms results in an excerpt which "scans" as
quickly as possible, conveying the gist of the content with minimal "visual
effort". I used Google's excerpting as a model.
{quote}

Google picks more than one fragment; it seems like it picks one or two
fragments.

I'm torn on whether IDF should really come into play though...

{quote}
> I think the base litmus test for a hightlighter is: if one were to
> take all fragments presented for a document (call this a "fragdoc")
> and make a new document from it, would that document match the
> original query?

With out the aid of formal studies to guide us, this is a subjective call.
FWIW, I disagree. In my view, visual scanning speed and coherence
are more important than completeness.

I'm not a big fan of the multi-fragment approach, because I think it takes too
much effort to grok each individual entry. Furthermore, the fact that the
fragments don't start on sentence boundaries (whenever feasible) adds to the
visual effort needed to orient yourself.

Search results contain a lot of junk. The user needs to be able to parse the
results page as quickly as possible and refine their search query as needed.
Noisy excerpts, with lots of elipses and few sentences that can be "swallowed
whole" impede that. Trees vs. Forest.

Again, that's my own aesthetic judgment, but I'll wager that there are studies
out there showing that fragments which start at the top of a sentence are
easier to consume, and I think that's important.
{quote}

I agree, it's not cut and dry here; this is all quite subjective.

I think one case that's tricky is two terms that do not tend do
co-occur in proximity.  Eg search for python greenspan on Google, and
most of the fragdocs consist of two fragments, one for each term.  Ie
google is trying to include all the terms in the fragdoc (my "coord
factor" question above).

{quote}
> In fact, I think the perfect highlighter would "logically" work as
> follows: take a single document and enumerate every single possible
> fragdoc.

KS uses a sliding window rather than chunking up the text into fragdocs of
fixed length.
{quote}

Or, the allowed length of each fragment could span a specified min/max
range.

And I like the sliding window approach instead of the pre-fragment
approach.

(Note: a fragdoc is one or more fragments stuck together, ie, the
entire excerpt.)


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682696#action_12682696 ]

Mark Miller commented on LUCENE-1522:
-------------------------------------

{quote}But, flattening could produce just a and d (I think?); and I think H1
could do the same even with SpanScorer (Mark is that true? I don't
fully understand the Query -> SpanQuery conversion). {quote}

Right - SpanScorer won't follow boolean logic - it will just break down each clause and not highlight  a NOT - similar to standard H1. If a particular clause is position sensitive, it will only be 'lit if its found in a valid position, but thats as deep as it goes.



> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682762#action_12682762 ]

Michael Busch commented on LUCENE-1522:
---------------------------------------

I wrote the highlighter for the OmniFind Yahoo Edition a few years ago
and I totally agree that all this stuff is very subjective.

The OYE highlighter is of course based on Lucene and uses a sliding
window too. It also uses information about sentence boundaries and
prefers fragments that start at the beginning of a sentence.

So it goes through the document and generates fragment candidates on
the fly. It calculates a score for each fragment and puts it into a
priority queue. The score is calculated using different heuristics:
- fragments are boosted that start at the beginning of a sentence
- the more highlighted terms a fragment contains, the higher is it
scored
- more different highlighted terms scores higher than a lot of
- occurrences of the same term
- no tf-idf is used
- if a fragment does not start at the beginning of a sentence, then it
is scored higher if the highlighted term(s) occur(s) more in the middle
of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a'
is the highlighted term; this is being done to show as much context as
possible around a highlighted term
- only a single long fragment is created if it contains all query terms
(like google)
- The queue tries to gather fragments, so that the union of the fragments
contain as many different query terms as possible. So it might toss a
fragment in favor of one with a higher score, if it increases the
total number of different highlighted terms.
- For performance reasons there is an early termination if the
fragments in the queue contain all query terms.

Initially this highlighter also imitated Lucene's behavior to find the
highlighted positions. Last year I changed it to use SpanQueries. With
our flexible query parser (which I introduced on java-dev recently) we
have two different QueryBuilders. One creates the "normal" query, that
is executed to find the matching docs. Then a different QueryBuilder
creates SpanQueries from the same query for the highlighter.

The output of the highlighter is not formatted html, but rather an
object containing the unformatted text, together with offset
information for both fragments and highlights. These offset spans can
carry additional information, which can be used for multi-color
highlighting too. We then use an HTMLFormatter class to generate the
formatted text, also an XMLFormatter that keeps the offset information
separate from the actual text is possible (we're currently working on
such a XMLFormatter). This is useful for frontends written in e.g. Flex.

The performance of our highlighter is good and so far we have been
pretty happy with the quality of the excerpts, but there is still much
room for improvements.

I'd be happy to help working on a new highlighter. I think this is a
very important component, and Lucene's core should have a very good
and flexible one.

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Issue Comment Edited: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682762#action_12682762 ]

Michael Busch edited comment on LUCENE-1522 at 3/17/09 12:10 PM:
-----------------------------------------------------------------

I wrote the highlighter for the OmniFind Yahoo Edition a few years ago
and I totally agree that all this stuff is very subjective.

The OYE highlighter is of course based on Lucene and uses a sliding
window too. It also uses information about sentence boundaries and
prefers fragments that start at the beginning of a sentence.

So it goes through the document and generates fragment candidates on
the fly. It calculates a score for each fragment and puts it into a
priority queue. The score is calculated using different heuristics:
- fragments are boosted that start at the beginning of a sentence
- the more highlighted terms a fragment contains, the higher is it
scored
- more different highlighted terms scores higher than a lot of
- occurrences of the same term
- no tf-idf is used
- if a fragment does not start at the beginning of a sentence, then it
is scored higher if the highlighted term(s) occur(s) more in the middle
of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a'
is the highlighted term; this is being done to show as much context as
possible around a highlighted term
- only a single long fragment is created if it contains all query terms
(like google)
- The queue tries to gather fragments, so that the union of the fragments
contain as many different query terms as possible. So it might toss a
fragment in favor of one with a lower score, if it increases the
total number of different highlighted terms.
- For performance reasons there is an early termination if the
fragments in the queue contain all query terms.

Initially this highlighter also imitated Lucene's behavior to find the
highlighted positions. Last year I changed it to use SpanQueries. With
our flexible query parser (which I introduced on java-dev recently) we
have two different QueryBuilders. One creates the "normal" query, that
is executed to find the matching docs. Then a different QueryBuilder
creates SpanQueries from the same query for the highlighter.

The output of the highlighter is not formatted html, but rather an
object containing the unformatted text, together with offset
information for both fragments and highlights. These offset spans can
carry additional information, which can be used for multi-color
highlighting too. We then use an HTMLFormatter class to generate the
formatted text, also an XMLFormatter that keeps the offset information
separate from the actual text is possible (we're currently working on
such a XMLFormatter). This is useful for frontends written in e.g. Flex.

The performance of our highlighter is good and so far we have been
pretty happy with the quality of the excerpts, but there is still much
room for improvements.

I'd be happy to help working on a new highlighter. I think this is a
very important component, and Lucene's core should have a very good
and flexible one.

      was (Author: michaelbusch):
    I wrote the highlighter for the OmniFind Yahoo Edition a few years ago
and I totally agree that all this stuff is very subjective.

The OYE highlighter is of course based on Lucene and uses a sliding
window too. It also uses information about sentence boundaries and
prefers fragments that start at the beginning of a sentence.

So it goes through the document and generates fragment candidates on
the fly. It calculates a score for each fragment and puts it into a
priority queue. The score is calculated using different heuristics:
- fragments are boosted that start at the beginning of a sentence
- the more highlighted terms a fragment contains, the higher is it
scored
- more different highlighted terms scores higher than a lot of
- occurrences of the same term
- no tf-idf is used
- if a fragment does not start at the beginning of a sentence, then it
is scored higher if the highlighted term(s) occur(s) more in the middle
of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a'
is the highlighted term; this is being done to show as much context as
possible around a highlighted term
- only a single long fragment is created if it contains all query terms
(like google)
- The queue tries to gather fragments, so that the union of the fragments
contain as many different query terms as possible. So it might toss a
fragment in favor of one with a higher score, if it increases the
total number of different highlighted terms.
- For performance reasons there is an early termination if the
fragments in the queue contain all query terms.

Initially this highlighter also imitated Lucene's behavior to find the
highlighted positions. Last year I changed it to use SpanQueries. With
our flexible query parser (which I introduced on java-dev recently) we
have two different QueryBuilders. One creates the "normal" query, that
is executed to find the matching docs. Then a different QueryBuilder
creates SpanQueries from the same query for the highlighter.

The output of the highlighter is not formatted html, but rather an
object containing the unformatted text, together with offset
information for both fragments and highlights. These offset spans can
carry additional information, which can be used for multi-color
highlighting too. We then use an HTMLFormatter class to generate the
formatted text, also an XMLFormatter that keeps the offset information
separate from the actual text is possible (we're currently working on
such a XMLFormatter). This is useful for frontends written in e.g. Flex.

The performance of our highlighter is good and so far we have been
pretty happy with the quality of the excerpts, but there is still much
room for improvements.

I'd be happy to help working on a new highlighter. I think this is a
very important component, and Lucene's core should have a very good
and flexible one.
 

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682777#action_12682777 ]

Marvin Humphrey commented on LUCENE-1522:
-----------------------------------------

>> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
>> spans produced by their children.
>
> Hmm - it seems like that loses information.  Ie, for ANDQuery, you lose the
> fact that you should try to include a match from each of the sub-clauses' spans.

A good idea.  ANDQuery's highlightSpans() method could probably be improved by
post-processing the child spans to take this into account.  That way we
wouldn't have to gum up the main Highlighter code with a bunch of conditionals
which afford special treatment to certain query types.

> What I meant was: all other things being equal, do you more strongly
> favor a fragment that has all N of the terms in a query vs another
> fragment that has fewer than N but say higher net number of occurrences.

No, the diversity of the terms in a fragment isn't factored in.  The span
objects only tell the Highlighter that a particular range of characters
was important; they don't say why.

However, note that IDF would prevent a bunch of hits on "the" from causing too
hot a hotspot in the heat map.  So you're likely to see fragments with high
discriminatory value.

> Google picks more than one fragment; it seems like it picks one or two
> fragments.

I probably overstated my opposition to supplying an excerpt containing more
than one fragment.  It seems OK to me to select more than one, so long as they
all scan easily, and so long as the excerpts don't get long enough to force
excessive scrolling and slow down the time it takes the user to scan the whole
results page.  

What bothers me is that the excerpts don't scan easily right now.  I consider
that a much more important defect than the fact that the fragdoc doesn't hit
every term (which isn't even possible for large queries), and it seemed to me
that pursuing exhaustive term matching was likely to yield even more highly
fragmented, visually chaotic fragdocs.  

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682985#action_12682985 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------


{quote}
>> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
>> spans produced by their children.
>
> Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the
> fact that you should try to include a match from each of the sub-clauses' spans.

A good idea. ANDQuery's highlightSpans() method could probably be improved by
post-processing the child spans to take this into account. That way we
wouldn't have to gum up the main Highlighter code with a bunch of conditionals
which afford special treatment to certain query types.
{quote}

I think we may need a tree-structured result returned by the
Weight/Scorer, compactly representing the "space" of valid fragdocs
for this one doc.  And then somehow we walk that tree,
enumerating/scoring individual "valid" fragdocs that are created from
that tree.

{quote}
> What I meant was: all other things being equal, do you more strongly
> favor a fragment that has all N of the terms in a query vs another
> fragment that has fewer than N but say higher net number of occurrences.

No, the diversity of the terms in a fragment isn't factored in. The span
objects only tell the Highlighter that a particular range of characters
was important; they don't say why.

However, note that IDF would prevent a bunch of hits on "the" from causing too
hot a hotspot in the heat map. So you're likely to see fragments with high
discriminatory value.
{quote}

This still seems subjectively wrong to me.  If I search for "president
bush", probably bush is the rarer term and so you would favor showing
me a single fragment that had bush occur twice, over a fragment that
had a single occurrence of president and bush?

{quote}
> Google picks more than one fragment; it seems like it picks one or two
> fragments.

I probably overstated my opposition to supplying an excerpt containing more
than one fragment. It seems OK to me to select more than one, so long as they
all scan easily, and so long as the excerpts don't get long enough to force
excessive scrolling and slow down the time it takes the user to scan the whole
results page.

What bothers me is that the excerpts don't scan easily right now. I consider
that a much more important defect than the fact that the fragdoc doesn't hit
every term (which isn't even possible for large queries), and it seemed to me
that pursuing exhaustive term matching was likely to yield even more highly
fragmented, visually chaotic fragdocs.
{quote}

Which excerpts don't scan easily right now?  Google's, KS's, Lucene's
H1 or H2?

I think with a tree structure representing the search space for all
fragdocs, we could then efficiently enumerate fradocs with an
appropriate scoring model (favoring sentence starts or surrounding
context, breadth of terms, etc.).  This way we can do a real search
(on all fragdocs) subject to the preference for
consumability/breadth.


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682987#action_12682987 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------


OK to sum up here with observations / wish list / ideas /
controversies / etc. for Lucene's future merged highlighter:

  * Fragmenter should aim for fast "eye + brain scanning
    consumability" (eg, try hard to start on sentence boundaries,
    include context)

  * Let's try for single source -- each Query/Weight/Scorer should be
    able to enumerate the set of term positions/spans that caused it
    to match a specific doc (like explain(), but provides
    positions/spans detailing the match).  Trying to "reverse
    engineer" the matching is brittle

  * Sliding window is better than static "top down" fragmentation

  * To scale, we should make a simple IndexReader impl on top of term
    vectors, but still allow the "re-index single doc on the fly"
    option

  * Favoring breadth (more unique terms instead of many occurences of
    certain terms) seems important, except for too-many-term queries
    where this gets unwieldy

  * Prefer a single fragment if it scores well enough, but fall back
    to several, if necessary, to show "breadth"

  * Produce structured output so non-HTML front ends (eg Flex) can
    render

  * Try to include "context around the hits", when possible (eg the
    "favor middle of hte sentence" that Michael described)

  * Maybe or maybe don't let IDF affect fragment scoring

  * Performance is important -- use TermVectors if present, add early
    termination if you've already found a good enough fragdoc, etc.

  * Maybe a tree-based fragdoc enumeration / searching model; I think
    this'd be even more efficient than sliding window, especially for
    large docs

  * Multi-color, HeatMap default ootb HTML UIs are nice

  * It's all very subjective and quite a good challenge!!

In the meantime, it seems like we should commit this H2 and give users
the choice?  We can then iterate over time on our wish list....


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683030#action_12683030 ]

Marvin Humphrey commented on LUCENE-1522:
-----------------------------------------

> I think we may need a tree-structured result returned by the
> Weight/Scorer, compactly representing the "space" of valid fragdocs
> for this one doc. And then somehow we walk that tree,
> enumerating/scoring individual "valid" fragdocs that are created from
> that tree.

Something like that.  An array of span scores is too limited; a full fledged
class would do better.  Designing that class requires striking a balance
between what information we think is useful and what information Highlighter
can sanely reduce.  By proposing the tree structure, you're suggesting that
Highlighter will reverse engineer boolean matching; that sounds like a lot of
work to me.  

>> However, note that IDF would prevent a bunch of hits on "the" from causing too
>> hot a hotspot in the heat map. So you're likely to see fragments with high
>> discriminatory value.
>
> This still seems subjectively wrong to me. If I search for "president
> bush", probably bush is the rarer term and so you would favor showing
> me a single fragment that had bush occur twice, over a fragment that
> had a single occurrence of president and bush?

We've ended up in a false dichotomy.  Favoring high IDF terms -- or more
accurately, high scoring character position spans -- and favoring fragments
with high term diversity are not mutually exclusive.  

Still, the KS highlighter probably wouldn't do what you describe.  The proximity
boosting accelerates as the spans approach each other, and maxes out if
they're adjacent.  So "bush bush" might be prefered over "president bush",
but "bush or bush" proabably wouldn't.

I don't think that there's anything wrong with preferring high term diversity;
the KS highlighter doesn't happen to support favoring fragments with high term
diversity now, but would be improved by adding that capability.  I just don't
think term diversity is so important that it qualifies as a "base litmus
test".

There are other ways of choosing good fragments, and IDF is one of them.  If
you want to show why a doc matched a query, it makes sense to show the section
of the document that contributed most to the score, surrounded by a little
context.  

> Which excerpts don't scan easily right now? Google's, KS's, Lucene's
> H1 or H2?

Lucene H1.  Too many elipses, and fragments don't prefer to start on sentence
boundaries.  

I have to qualify the assertion that the fragments don't scan well with the caveat
that I'm basing this on a personal impression.  However, I'm pretty confident
about that impression.  I would be stunned if there were not studies out there
demonstrating that sentence fragments which begin at the top are easier to
consume than sentence fragments which begin in the middle.

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683032#action_12683032 ]

Mark Miller commented on LUCENE-1522:
-------------------------------------

bq. Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries.

Thats not necessarily a property of the Highlighter, just the basic implementations we currently supply for the pluggable classes. You can supply a custom fragmenter and you can control the number of fragments.

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683053#action_12683053 ]

Michael McCandless commented on LUCENE-1522:
--------------------------------------------


{quote}
Something like that. An array of span scores is too limited; a full fledged
class would do better. Designing that class requires striking a balance
between what information we think is useful and what information Highlighter
can sanely reduce.
{quote}

Agreed, and I'm not sure about the tree structure (just floating
ideas...).  It could very well be overkill.

{quote}
By proposing the tree structure, you're suggesting that
Highlighter will reverse engineer boolean matching; that sounds like a lot of
work to me.
{quote}

It wouldn't be reverse engineered: BooleanQuery/Weight/Scorer2 itself
will have returned that.  Ie we would add a method to
"getSpanTree()".

{quote}
Still, the KS highlighter probably wouldn't do what you describe.  The proximity
boosting accelerates as the spans approach each other, and maxes out if
they're adjacent.  So "bush bush" might be prefered over "president bush",
but "bush or bush" proabably wouldn't.
{quote}

OK, it sounds like one can simply use different models to score
fragdocs and it's still an open debate how much each of these criteria
(IDF, showing surround context, being on sentence boundary, diversity
of terms) should impact the score.  I agree, the "basic litmus test" I
proposed is too strong.

{quote}
bq. Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries.

Thats not necessarily a property of the Highlighter, just the basic
implementations we currently supply for the pluggable classes. You can
supply a custom fragmenter and you can control the number of
fragments.
{quote}

I agree: H1 is very pluggable and one could plug in a better
fragmenter, but we don't offer such an impl in H1, and this is a case
where "out-of-the-box defaults" are very important.


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-1522) another highlighter

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683064#action_12683064 ]

Marvin Humphrey commented on LUCENE-1522:
-----------------------------------------

> OK, it sounds like one can simply use different models to score
> fragdocs and it's still an open debate how much each of these criteria
> (IDF, showing surround context, being on sentence boundary, diversity
> of terms) should impact the score.

With Michael Busch's priority queue approach, the algorithm for choosing the
fragments can be abstracted into the class of object we put in the queue and
its lessThan() method.  The output from the queue just has to be something the
Highlighter can chew.

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...

< Prev | 1 - 2 - 3 - 4 | Next >