|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 | Next > |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682492#action_12682492 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} I'd also like it to work if you don't have termvectors stored (though be faster if they are perhaps, as it is now). {quote} I agree. {quote} Getting hit positions for position sensitive clauses requires converting the query to a span query and calling getSpans on a memory index {quote} Is the reason why H1 creates the full token stream (even when TermVectors is the source) in order to build the MemoryIndex? If term vectors (w/ positions, offsets) were stored, wouldn't it be possible to make a simple index (or at least TermDocs, TermPositions) wrapped on those TermVectors? > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682494#action_12682494 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} But, if you have for example a document 'a b c a b c' and the query 'a AND b', then this approach would only highlight the first two terms, no? {quote} Ahh right -- in fact, nothing would be highlighted because the scorer for AND queries doesn't visit positions at all (it doesn't need to). I guess we'd have to ask such scorers to forcefully visit positions & enumerate all matches within one doc, when running in "highlight" mode. Hmm, feeling like a big change... But maybe it could work. It'd be sort of like a positional-aware "explain", ie "show me the term occurrences that allowed the full query to accept this document". Imagine query "(a AND b) OR (c AND d)". When looking at the fragments for each doc, I would want to see both a AND b, or both c AND d, but never (for example) just a and d. But, flattening could produce just a and d (I think?); and I think H1 could do the same even with SpanScorer (Mark is that true? I don't fully understand the Query -> SpanQuery conversion). Whereas if we could ask for positions of the "real" matches I think it would work correctly? > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682502#action_12682502 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- bq. Not sure it solves being able to gets offsets from the query terms and still mask for positions though Can you explain that more? > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682506#action_12682506 ] Mark Miller commented on LUCENE-1522: ------------------------------------- {quote}Is the reason why H1 creates the full token stream (even when TermVectors is the source) in order to build the MemoryIndex? If term vectors (w/ positions, offsets) were stored, wouldn't it be possible to make a simple index (or at least TermDocs, TermPositions) wrapped on those TermVectors? {quote} It creates the full tokenstream because it was designed to work without termvectors, and so without offset info for the query terms, it rebuilds the stream and processes a token at a time - the api gives you hooks to highlight at any of these tokens - thats essentially the bottleneck I think - taking everything a token at a time, but the whole API is based on that fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it was convenient to fit into the current highlighter API to start. I had it in my mind to break from the API and make a largedoc highlighter that didn't need termvectors, but I found the memory index and getspans to still be too slow in my initial testing. I'd hoped to work more on it, but havn't had a chance. So essentially, while more can be done with termvectors, the improvements break the current API at a pretty deep level - no one has done the work to solve that I guess - which is why we have the alternate highlighters. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682515#action_12682515 ] Marvin Humphrey commented on LUCENE-1522: ----------------------------------------- > It'd be sort of like a positional-aware "explain", ie "show me the term > occurrences that allowed the full query to accept this document". FWIW, this is more or less how the KinoSearch highlighter now works in svn trunk. It doesn't use a Scorer, though, but instead the KS analogue to Lucene's "Weight" class. The (Weight) is fed what is essentially a single doc index, using stored term vectors. Weight.highlightSpans() returns an array of "span" objects, each of which has a start offset, a length, and a score. The Highlighter then processes these span objects to create a "heat map" and choose its excerpt points. The idea is that by delegating responsibility for creating the scoring spans, we make it easier to support arbitrary Query implementations with a single Highlighter class. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682506#action_12682506 ] Mark Miller edited comment on LUCENE-1522 at 3/16/09 4:43 PM: -------------------------------------------------------------- {quote}Is the reason why H1 creates the full token stream (even when TermVectors is the source) in order to build the MemoryIndex? If term vectors (w/ positions, offsets) were stored, wouldn't it be possible to make a simple index (or at least TermDocs, TermPositions) wrapped on those TermVectors? {quote} It creates the full tokenstream because it was designed to work without termvectors, and so without offset info for the query terms, it rebuilds the stream and processes a token at a time - the api gives you hooks to highlight at any of these tokens - thats essentially the bottleneck I think - taking everything a token at a time, but the whole API is based on that fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it was convenient to fit into the current highlighter API to start. I had it in my mind to break from the API and make a largedoc highlighter that didn't need termvectors, but I found the memory index and getspans to still be too slow in my initial testing. I'd hoped to work more on it, but havn't had a chance. So essentially, while more can be done with termvectors, the improvements break the current API at a pretty deep level - no one has done the work to solve that I guess - which is why we have the alternate highlighters. *edit* I suppose one of the main problems with my briefly tested large doc approach I tried is that it still requires that you rebuild the tokenstream (and I was attempting to not use termvectors either). Avoiding the need for that would probably make it much more competitive. was (Author: markrmiller@...): {quote}Is the reason why H1 creates the full token stream (even when TermVectors is the source) in order to build the MemoryIndex? If term vectors (w/ positions, offsets) were stored, wouldn't it be possible to make a simple index (or at least TermDocs, TermPositions) wrapped on those TermVectors? {quote} It creates the full tokenstream because it was designed to work without termvectors, and so without offset info for the query terms, it rebuilds the stream and processes a token at a time - the api gives you hooks to highlight at any of these tokens - thats essentially the bottleneck I think - taking everything a token at a time, but the whole API is based on that fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it was convenient to fit into the current highlighter API to start. I had it in my mind to break from the API and make a largedoc highlighter that didn't need termvectors, but I found the memory index and getspans to still be too slow in my initial testing. I'd hoped to work more on it, but havn't had a chance. So essentially, while more can be done with termvectors, the improvements break the current API at a pretty deep level - no one has done the work to solve that I guess - which is why we have the alternate highlighters. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682609#action_12682609 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} > It'd be sort of like a positional-aware "explain", ie "show me the term > occurrences that allowed the full query to accept this document". FWIW, this is more or less how the KinoSearch highlighter now works in svn trunk. It doesn't use a Scorer, though, but instead the KS analogue to Lucene's "Weight" class. The (Weight) is fed what is essentially a single doc index, using stored term vectors. Weight.highlightSpans() returns an array of "span" objects, each of which has a start offset, a length, and a score. The Highlighter then processes these span objects to create a "heat map" and choose its excerpt points. The idea is that by delegating responsibility for creating the scoring spans, we make it easier to support arbitrary Query implementations with a single Highlighter class. {quote} Awesome! Do you require term vectors to be stored, for highlighting (cannot re-analyze the text)? For queries that normally do not use positions at all (simple AND/OR of terms), how does your highlightSpans() work? For BooleanQuery, is coord factor used to favor fragment sets that include more unique terms? Are you guaranteed to always present a net set of fragments that "matches" the query? (eg the example query above). I think the base litmus test for a hightlighter is: if one were to take all fragments presented for a document (call this a "fragdoc") and make a new document from it, would that document match the original query? In fact, I think the perfect highlighter would "logically" work as follows: take a single document and enumerate every single possible fragdoc. Each fragdoc is allowed to have maxNumFragments fragments, where each fragment has a min/max number of characters. The set of fragdocs is of course ridiculously immense. Take this massive collection of fragdocs and build a new temporary index, then run your Query against that index. Many of the fragdocs would not match the Query, so they are eliminated right off (this is the litmus test). Then, of the ones that do, you want the highest scoring fragdocs. Obviously you can't actually implement a highlighter like that, but I think "logically" that is the optimal highlighter that we are trying to emulate with more efficient implementations. I think having the Query/Weight/Scorer class be the single-source for hits, explanation & highlight spans is the right approach. Having a whole separate package trying to reverse-engineer where matches had taken place between Query and Document is hard to get right. EG BooleanScorer2's coord factor would naturally/correctly influence the selection. I also think building a [reduced, just Postings] IndexReader API on top of TermVectors ought to be a simple way to get great performance here. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682672#action_12682672 ] Marvin Humphrey commented on LUCENE-1522: ----------------------------------------- > Do you require term vectors to be stored, for highlighting (cannot > re-analyze the text)? Yes, but that's not fundamental to the design. You just have to hand the Weight some sort of single-doc index that includes sufficient data to determine what parts of the text contributed to the hit and how much they contributed. The Weight needn't care whether that single-doc index was created on the fly or stored at index time. > For queries that normally do not use positions at all (simple AND/OR > of terms), how does your highlightSpans() work? ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the spans produced by their children. > For BooleanQuery, is coord factor used to favor fragment sets that > include more unique terms? No; I don't think that would be fine grained enough to help. There's a HeatMap class that performs additional weighting. Spans that cluster together tightly (i.e. that could fit together within the excerpt) are boosted. > Are you guaranteed to always present a net set of fragments that > "matches" the query? (eg the example query above). No. The KS version supplies a single fragment. It naturally prefers fragments with rarer terms, because the span scores are multiplied by the Weight's weighting factor (which includes IDF). Once that fragment is selected, the KS highlighter worries a lot about trimming to sensible sentence boundaries. In my own subjective judgment, supplying a single maximally coherent fragment which prefers clusters of rare terms results in an excerpt which "scans" as quickly as possible, conveying the gist of the content with minimal "visual effort". I used Google's excerpting as a model. > I think the base litmus test for a hightlighter is: if one were to > take all fragments presented for a document (call this a "fragdoc") > and make a new document from it, would that document match the > original query? With out the aid of formal studies to guide us, this is a subjective call. FWIW, I disagree. In my view, visual scanning speed and coherence are more important than completeness. I'm not a big fan of the multi-fragment approach, because I think it takes too much effort to grok each individual entry. Furthermore, the fact that the fragments don't start on sentence boundaries (whenever feasible) adds to the visual effort needed to orient yourself. Search results contain a lot of junk. The user needs to be able to parse the results page as quickly as possible and refine their search query as needed. Noisy excerpts, with lots of elipses and few sentences that can be "swallowed whole" impede that. Trees vs. Forest. Again, that's my own aesthetic judgment, but I'll wager that there are studies out there showing that fragments which start at the top of a sentence are easier to consume, and I think that's important. > In fact, I think the perfect highlighter would "logically" work as > follows: take a single document and enumerate every single possible > fragdoc. KS uses a sliding window rather than chunking up the text into fragdocs of fixed length. > Having a whole separate package trying to reverse-engineer where matches had > taken place between Query and Document is hard to get right. Exactly. PS: Obviously, refinements of the highlighting algo will help Lucy, too. I don't suppose you want to continue this on the Lucy dev list so that Lucy banks some community credit for this discussion. :\ > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682688#action_12682688 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} PS: Obviously, refinements of the highlighting algo will help Lucy, too. I don't suppose you want to continue this on the Lucy dev list so that Lucy banks some community credit for this discussion. :\ {quote} Well... remember that more discussions between you and I and Nathan on Lucy-dev (as much as I love having them) don't really "count" as a "bigger" community. In other words, like the scoring of a BooleanQuery, there is a very strong coord factor at play when measuring "community". If you and I and nathan have fewer conversations on Lucy-dev, but then two other new people join in, that is a much stronger community. So, maybe send a note to lucy-dev, referencing this as a relevant discussion to Lucy's approach to highlighting... and leave a tantalizing invite here for others to make the jump to lucy-dev. Growing a community is not easy! > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682689#action_12682689 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} > Do you require term vectors to be stored, for highlighting (cannot > re-analyze the text)? Yes, but that's not fundamental to the design. You just have to hand the Weight some sort of single-doc index that includes sufficient data to determine what parts of the text contributed to the hit and how much they contributed. The Weight needn't care whether that single-doc index was created on the fly or stored at index time. {quote} OK. {quote} > For queries that normally do not use positions at all (simple AND/OR > of terms), how does your highlightSpans() work? ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the spans produced by their children. {quote} Hmm -- it seems like that loses information. Ie, for ANDQuery, you lose the fact that you should try to include a match from each of the sub-clauses' spans. {quote} > For BooleanQuery, is coord factor used to favor fragment sets that > include more unique terms? No; I don't think that would be fine grained enough to help. {quote} What I meant was: all other things being equal, do you more strongly favor a fragment that has all N of the terms in a query vs another fragment that has fewer than N but say higher net number of occurrences. {quote} There's a HeatMap class that performs additional weighting. Spans that cluster together tightly (i.e. that could fit together within the excerpt) are boosted. {quote} That sounds great. {quote} > Are you guaranteed to always present a net set of fragments that > "matches" the query? (eg the example query above). No. The KS version supplies a single fragment. It naturally prefers fragments with rarer terms, because the span scores are multiplied by the Weight's weighting factor (which includes IDF). {quote} Hmm OK. {quote} Once that fragment is selected, the KS highlighter worries a lot about trimming to sensible sentence boundaries. {quote} I totally agree: easy/fast consumability is very important, so choosing entire sentences, or at least anchoring the start or maybe end on a sentence boundary, is important. Lucene's H1 doesn't do this ootb today I think (though you could plug in your own fragmenter). {quote} In my own subjective judgment, supplying a single maximally coherent fragment which prefers clusters of rare terms results in an excerpt which "scans" as quickly as possible, conveying the gist of the content with minimal "visual effort". I used Google's excerpting as a model. {quote} Google picks more than one fragment; it seems like it picks one or two fragments. I'm torn on whether IDF should really come into play though... {quote} > I think the base litmus test for a hightlighter is: if one were to > take all fragments presented for a document (call this a "fragdoc") > and make a new document from it, would that document match the > original query? With out the aid of formal studies to guide us, this is a subjective call. FWIW, I disagree. In my view, visual scanning speed and coherence are more important than completeness. I'm not a big fan of the multi-fragment approach, because I think it takes too much effort to grok each individual entry. Furthermore, the fact that the fragments don't start on sentence boundaries (whenever feasible) adds to the visual effort needed to orient yourself. Search results contain a lot of junk. The user needs to be able to parse the results page as quickly as possible and refine their search query as needed. Noisy excerpts, with lots of elipses and few sentences that can be "swallowed whole" impede that. Trees vs. Forest. Again, that's my own aesthetic judgment, but I'll wager that there are studies out there showing that fragments which start at the top of a sentence are easier to consume, and I think that's important. {quote} I agree, it's not cut and dry here; this is all quite subjective. I think one case that's tricky is two terms that do not tend do co-occur in proximity. Eg search for python greenspan on Google, and most of the fragdocs consist of two fragments, one for each term. Ie google is trying to include all the terms in the fragdoc (my "coord factor" question above). {quote} > In fact, I think the perfect highlighter would "logically" work as > follows: take a single document and enumerate every single possible > fragdoc. KS uses a sliding window rather than chunking up the text into fragdocs of fixed length. {quote} Or, the allowed length of each fragment could span a specified min/max range. And I like the sliding window approach instead of the pre-fragment approach. (Note: a fragdoc is one or more fragments stuck together, ie, the entire excerpt.) > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682696#action_12682696 ] Mark Miller commented on LUCENE-1522: ------------------------------------- {quote}But, flattening could produce just a and d (I think?); and I think H1 could do the same even with SpanScorer (Mark is that true? I don't fully understand the Query -> SpanQuery conversion). {quote} Right - SpanScorer won't follow boolean logic - it will just break down each clause and not highlight a NOT - similar to standard H1. If a particular clause is position sensitive, it will only be 'lit if its found in a valid position, but thats as deep as it goes. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682762#action_12682762 ] Michael Busch commented on LUCENE-1522: --------------------------------------- I wrote the highlighter for the OmniFind Yahoo Edition a few years ago and I totally agree that all this stuff is very subjective. The OYE highlighter is of course based on Lucene and uses a sliding window too. It also uses information about sentence boundaries and prefers fragments that start at the beginning of a sentence. So it goes through the document and generates fragment candidates on the fly. It calculates a score for each fragment and puts it into a priority queue. The score is calculated using different heuristics: - fragments are boosted that start at the beginning of a sentence - the more highlighted terms a fragment contains, the higher is it scored - more different highlighted terms scores higher than a lot of - occurrences of the same term - no tf-idf is used - if a fragment does not start at the beginning of a sentence, then it is scored higher if the highlighted term(s) occur(s) more in the middle of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a' is the highlighted term; this is being done to show as much context as possible around a highlighted term - only a single long fragment is created if it contains all query terms (like google) - The queue tries to gather fragments, so that the union of the fragments contain as many different query terms as possible. So it might toss a fragment in favor of one with a higher score, if it increases the total number of different highlighted terms. - For performance reasons there is an early termination if the fragments in the queue contain all query terms. Initially this highlighter also imitated Lucene's behavior to find the highlighted positions. Last year I changed it to use SpanQueries. With our flexible query parser (which I introduced on java-dev recently) we have two different QueryBuilders. One creates the "normal" query, that is executed to find the matching docs. Then a different QueryBuilder creates SpanQueries from the same query for the highlighter. The output of the highlighter is not formatted html, but rather an object containing the unformatted text, together with offset information for both fragments and highlights. These offset spans can carry additional information, which can be used for multi-color highlighting too. We then use an HTMLFormatter class to generate the formatted text, also an XMLFormatter that keeps the offset information separate from the actual text is possible (we're currently working on such a XMLFormatter). This is useful for frontends written in e.g. Flex. The performance of our highlighter is good and so far we have been pretty happy with the quality of the excerpts, but there is still much room for improvements. I'd be happy to help working on a new highlighter. I think this is a very important component, and Lucene's core should have a very good and flexible one. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682762#action_12682762 ] Michael Busch edited comment on LUCENE-1522 at 3/17/09 12:10 PM: ----------------------------------------------------------------- I wrote the highlighter for the OmniFind Yahoo Edition a few years ago and I totally agree that all this stuff is very subjective. The OYE highlighter is of course based on Lucene and uses a sliding window too. It also uses information about sentence boundaries and prefers fragments that start at the beginning of a sentence. So it goes through the document and generates fragment candidates on the fly. It calculates a score for each fragment and puts it into a priority queue. The score is calculated using different heuristics: - fragments are boosted that start at the beginning of a sentence - the more highlighted terms a fragment contains, the higher is it scored - more different highlighted terms scores higher than a lot of - occurrences of the same term - no tf-idf is used - if a fragment does not start at the beginning of a sentence, then it is scored higher if the highlighted term(s) occur(s) more in the middle of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a' is the highlighted term; this is being done to show as much context as possible around a highlighted term - only a single long fragment is created if it contains all query terms (like google) - The queue tries to gather fragments, so that the union of the fragments contain as many different query terms as possible. So it might toss a fragment in favor of one with a lower score, if it increases the total number of different highlighted terms. - For performance reasons there is an early termination if the fragments in the queue contain all query terms. Initially this highlighter also imitated Lucene's behavior to find the highlighted positions. Last year I changed it to use SpanQueries. With our flexible query parser (which I introduced on java-dev recently) we have two different QueryBuilders. One creates the "normal" query, that is executed to find the matching docs. Then a different QueryBuilder creates SpanQueries from the same query for the highlighter. The output of the highlighter is not formatted html, but rather an object containing the unformatted text, together with offset information for both fragments and highlights. These offset spans can carry additional information, which can be used for multi-color highlighting too. We then use an HTMLFormatter class to generate the formatted text, also an XMLFormatter that keeps the offset information separate from the actual text is possible (we're currently working on such a XMLFormatter). This is useful for frontends written in e.g. Flex. The performance of our highlighter is good and so far we have been pretty happy with the quality of the excerpts, but there is still much room for improvements. I'd be happy to help working on a new highlighter. I think this is a very important component, and Lucene's core should have a very good and flexible one. was (Author: michaelbusch): I wrote the highlighter for the OmniFind Yahoo Edition a few years ago and I totally agree that all this stuff is very subjective. The OYE highlighter is of course based on Lucene and uses a sliding window too. It also uses information about sentence boundaries and prefers fragments that start at the beginning of a sentence. So it goes through the document and generates fragment candidates on the fly. It calculates a score for each fragment and puts it into a priority queue. The score is calculated using different heuristics: - fragments are boosted that start at the beginning of a sentence - the more highlighted terms a fragment contains, the higher is it scored - more different highlighted terms scores higher than a lot of - occurrences of the same term - no tf-idf is used - if a fragment does not start at the beginning of a sentence, then it is scored higher if the highlighted term(s) occur(s) more in the middle of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a' is the highlighted term; this is being done to show as much context as possible around a highlighted term - only a single long fragment is created if it contains all query terms (like google) - The queue tries to gather fragments, so that the union of the fragments contain as many different query terms as possible. So it might toss a fragment in favor of one with a higher score, if it increases the total number of different highlighted terms. - For performance reasons there is an early termination if the fragments in the queue contain all query terms. Initially this highlighter also imitated Lucene's behavior to find the highlighted positions. Last year I changed it to use SpanQueries. With our flexible query parser (which I introduced on java-dev recently) we have two different QueryBuilders. One creates the "normal" query, that is executed to find the matching docs. Then a different QueryBuilder creates SpanQueries from the same query for the highlighter. The output of the highlighter is not formatted html, but rather an object containing the unformatted text, together with offset information for both fragments and highlights. These offset spans can carry additional information, which can be used for multi-color highlighting too. We then use an HTMLFormatter class to generate the formatted text, also an XMLFormatter that keeps the offset information separate from the actual text is possible (we're currently working on such a XMLFormatter). This is useful for frontends written in e.g. Flex. The performance of our highlighter is good and so far we have been pretty happy with the quality of the excerpts, but there is still much room for improvements. I'd be happy to help working on a new highlighter. I think this is a very important component, and Lucene's core should have a very good and flexible one. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682777#action_12682777 ] Marvin Humphrey commented on LUCENE-1522: ----------------------------------------- >> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the >> spans produced by their children. > > Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the > fact that you should try to include a match from each of the sub-clauses' spans. A good idea. ANDQuery's highlightSpans() method could probably be improved by post-processing the child spans to take this into account. That way we wouldn't have to gum up the main Highlighter code with a bunch of conditionals which afford special treatment to certain query types. > What I meant was: all other things being equal, do you more strongly > favor a fragment that has all N of the terms in a query vs another > fragment that has fewer than N but say higher net number of occurrences. No, the diversity of the terms in a fragment isn't factored in. The span objects only tell the Highlighter that a particular range of characters was important; they don't say why. However, note that IDF would prevent a bunch of hits on "the" from causing too hot a hotspot in the heat map. So you're likely to see fragments with high discriminatory value. > Google picks more than one fragment; it seems like it picks one or two > fragments. I probably overstated my opposition to supplying an excerpt containing more than one fragment. It seems OK to me to select more than one, so long as they all scan easily, and so long as the excerpts don't get long enough to force excessive scrolling and slow down the time it takes the user to scan the whole results page. What bothers me is that the excerpts don't scan easily right now. I consider that a much more important defect than the fact that the fragdoc doesn't hit every term (which isn't even possible for large queries), and it seemed to me that pursuing exhaustive term matching was likely to yield even more highly fragmented, visually chaotic fragdocs. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682985#action_12682985 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} >> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the >> spans produced by their children. > > Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the > fact that you should try to include a match from each of the sub-clauses' spans. A good idea. ANDQuery's highlightSpans() method could probably be improved by post-processing the child spans to take this into account. That way we wouldn't have to gum up the main Highlighter code with a bunch of conditionals which afford special treatment to certain query types. {quote} I think we may need a tree-structured result returned by the Weight/Scorer, compactly representing the "space" of valid fragdocs for this one doc. And then somehow we walk that tree, enumerating/scoring individual "valid" fragdocs that are created from that tree. {quote} > What I meant was: all other things being equal, do you more strongly > favor a fragment that has all N of the terms in a query vs another > fragment that has fewer than N but say higher net number of occurrences. No, the diversity of the terms in a fragment isn't factored in. The span objects only tell the Highlighter that a particular range of characters was important; they don't say why. However, note that IDF would prevent a bunch of hits on "the" from causing too hot a hotspot in the heat map. So you're likely to see fragments with high discriminatory value. {quote} This still seems subjectively wrong to me. If I search for "president bush", probably bush is the rarer term and so you would favor showing me a single fragment that had bush occur twice, over a fragment that had a single occurrence of president and bush? {quote} > Google picks more than one fragment; it seems like it picks one or two > fragments. I probably overstated my opposition to supplying an excerpt containing more than one fragment. It seems OK to me to select more than one, so long as they all scan easily, and so long as the excerpts don't get long enough to force excessive scrolling and slow down the time it takes the user to scan the whole results page. What bothers me is that the excerpts don't scan easily right now. I consider that a much more important defect than the fact that the fragdoc doesn't hit every term (which isn't even possible for large queries), and it seemed to me that pursuing exhaustive term matching was likely to yield even more highly fragmented, visually chaotic fragdocs. {quote} Which excerpts don't scan easily right now? Google's, KS's, Lucene's H1 or H2? I think with a tree structure representing the search space for all fragdocs, we could then efficiently enumerate fradocs with an appropriate scoring model (favoring sentence starts or surrounding context, breadth of terms, etc.). This way we can do a real search (on all fragdocs) subject to the preference for consumability/breadth. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682987#action_12682987 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- OK to sum up here with observations / wish list / ideas / controversies / etc. for Lucene's future merged highlighter: * Fragmenter should aim for fast "eye + brain scanning consumability" (eg, try hard to start on sentence boundaries, include context) * Let's try for single source -- each Query/Weight/Scorer should be able to enumerate the set of term positions/spans that caused it to match a specific doc (like explain(), but provides positions/spans detailing the match). Trying to "reverse engineer" the matching is brittle * Sliding window is better than static "top down" fragmentation * To scale, we should make a simple IndexReader impl on top of term vectors, but still allow the "re-index single doc on the fly" option * Favoring breadth (more unique terms instead of many occurences of certain terms) seems important, except for too-many-term queries where this gets unwieldy * Prefer a single fragment if it scores well enough, but fall back to several, if necessary, to show "breadth" * Produce structured output so non-HTML front ends (eg Flex) can render * Try to include "context around the hits", when possible (eg the "favor middle of hte sentence" that Michael described) * Maybe or maybe don't let IDF affect fragment scoring * Performance is important -- use TermVectors if present, add early termination if you've already found a good enough fragdoc, etc. * Maybe a tree-based fragdoc enumeration / searching model; I think this'd be even more efficient than sliding window, especially for large docs * Multi-color, HeatMap default ootb HTML UIs are nice * It's all very subjective and quite a good challenge!! In the meantime, it seems like we should commit this H2 and give users the choice? We can then iterate over time on our wish list.... > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683030#action_12683030 ] Marvin Humphrey commented on LUCENE-1522: ----------------------------------------- > I think we may need a tree-structured result returned by the > Weight/Scorer, compactly representing the "space" of valid fragdocs > for this one doc. And then somehow we walk that tree, > enumerating/scoring individual "valid" fragdocs that are created from > that tree. Something like that. An array of span scores is too limited; a full fledged class would do better. Designing that class requires striking a balance between what information we think is useful and what information Highlighter can sanely reduce. By proposing the tree structure, you're suggesting that Highlighter will reverse engineer boolean matching; that sounds like a lot of work to me. >> However, note that IDF would prevent a bunch of hits on "the" from causing too >> hot a hotspot in the heat map. So you're likely to see fragments with high >> discriminatory value. > > This still seems subjectively wrong to me. If I search for "president > bush", probably bush is the rarer term and so you would favor showing > me a single fragment that had bush occur twice, over a fragment that > had a single occurrence of president and bush? We've ended up in a false dichotomy. Favoring high IDF terms -- or more accurately, high scoring character position spans -- and favoring fragments with high term diversity are not mutually exclusive. Still, the KS highlighter probably wouldn't do what you describe. The proximity boosting accelerates as the spans approach each other, and maxes out if they're adjacent. So "bush bush" might be prefered over "president bush", but "bush or bush" proabably wouldn't. I don't think that there's anything wrong with preferring high term diversity; the KS highlighter doesn't happen to support favoring fragments with high term diversity now, but would be improved by adding that capability. I just don't think term diversity is so important that it qualifies as a "base litmus test". There are other ways of choosing good fragments, and IDF is one of them. If you want to show why a doc matched a query, it makes sense to show the section of the document that contributed most to the score, surrounded by a little context. > Which excerpts don't scan easily right now? Google's, KS's, Lucene's > H1 or H2? Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries. I have to qualify the assertion that the fragments don't scan well with the caveat that I'm basing this on a personal impression. However, I'm pretty confident about that impression. I would be stunned if there were not studies out there demonstrating that sentence fragments which begin at the top are easier to consume than sentence fragments which begin in the middle. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683032#action_12683032 ] Mark Miller commented on LUCENE-1522: ------------------------------------- bq. Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries. Thats not necessarily a property of the Highlighter, just the basic implementations we currently supply for the pluggable classes. You can supply a custom fragmenter and you can control the number of fragments. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683053#action_12683053 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} Something like that. An array of span scores is too limited; a full fledged class would do better. Designing that class requires striking a balance between what information we think is useful and what information Highlighter can sanely reduce. {quote} Agreed, and I'm not sure about the tree structure (just floating ideas...). It could very well be overkill. {quote} By proposing the tree structure, you're suggesting that Highlighter will reverse engineer boolean matching; that sounds like a lot of work to me. {quote} It wouldn't be reverse engineered: BooleanQuery/Weight/Scorer2 itself will have returned that. Ie we would add a method to "getSpanTree()". {quote} Still, the KS highlighter probably wouldn't do what you describe. The proximity boosting accelerates as the spans approach each other, and maxes out if they're adjacent. So "bush bush" might be prefered over "president bush", but "bush or bush" proabably wouldn't. {quote} OK, it sounds like one can simply use different models to score fragdocs and it's still an open debate how much each of these criteria (IDF, showing surround context, being on sentence boundary, diversity of terms) should impact the score. I agree, the "basic litmus test" I proposed is too strong. {quote} bq. Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries. Thats not necessarily a property of the Highlighter, just the basic implementations we currently supply for the pluggable classes. You can supply a custom fragmenter and you can control the number of fragments. {quote} I agree: H1 is very pluggable and one could plug in a better fragmenter, but we don't offer such an impl in H1, and this is a case where "out-of-the-box defaults" are very important. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683064#action_12683064 ] Marvin Humphrey commented on LUCENE-1522: ----------------------------------------- > OK, it sounds like one can simply use different models to score > fragdocs and it's still an open debate how much each of these criteria > (IDF, showing surround context, being on sentence boundary, diversity > of terms) should impact the score. With Michael Busch's priority queue approach, the algorithm for choosing the fragments can be abstracted into the class of object we put in the queue and its lessThan() method. The output from the queue just has to be something the Highlighter can chew. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 - 4 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |