|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 | Next > |
|
|
[jira] Created: (LUCENE-1522) another highlighteranother highlighter
------------------- Key: LUCENE-1522 URL: https://issues.apache.org/jira/browse/LUCENE-1522 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Reporter: Koji Sekiguchi Priority: Minor I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); // docId=0, fieldName="content", fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, 0, "content", 100, 3 ); {code} features: - fast for large docs - supports "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q="w1 w2" <b>w1 w2</b> --------------- q="w1 w2"~1 <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-1522: ----------------------------------- Attachment: LUCENE-1522.patch to apply this patch, LUCENE-1448 also need to be applied. {code} $ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk $ cd trunk $ patch -p0 < LUCENE-1448.patch $ patch -p0 < LUCENE-1522.patch {code} > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > // docId=0, fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, 0, "content", 100, 3 ); > {code} > features: > - fast for large docs > - supports "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-1522: ----------------------------------- Description: I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName="content", fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q="w1 w2" <b>w1 w2</b> --------------- q="w1 w2"~1 <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers was: I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); // docId=0, fieldName="content", fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, 0, "content", 100, 3 ); {code} features: - fast for large docs - supports "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q="w1 w2" <b>w1 w2</b> --------------- q="w1 w2"~1 <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-1522: ----------------------------------- Description: I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName="content", fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q="w1 w2" <b>w1 w2</b> --------------- q="w1 w2"~1 <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers was: I've written this highlighter for my project to support bi-gram token stream. The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. usage: {code:java} TopDocs docs = searcher.search( query, 10 ); Highlighter h = new Highlighter(); FieldQuery fq = h.getFieldQuery( query ); for( ScoreDoc scoreDoc : docs.scoreDocs ){ // fieldName="content", fragCharSize=100, numFragments=3 String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); if( fragments != null ){ for( String fragment : fragments ) System.out.println( fragment ); } } {code} features: - fast for large docs - supports "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) - supports PhraseQuery, phrase-unit highlighting with slops {noformat} q="w1 w2" <b>w1 w2</b> --------------- q="w1 w2"~1 <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> {noformat} - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS - easy to apply patch due to independent package (contrib/highlighter2) - uses Java 1.5 - looks query boost to score fragments (currently doesn't see idf, but it should be possible) - pluggable FragListBuilder - pluggable FragmentsBuilder to do: - term positions can be unnecessary when phraseHighlight==false - collects performance numbers Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-1522: ----------------------------------- Attachment: LUCENE-1522.patch The attached patch has "colored tag highlighting" feature. :) I provided the following colored tags: {code:title=BaseFragmentsBuilder.java} public static final String[] COLORED_PRE_TAGS = { "<b style=\"background:yellow\">", "<b style=\"background:lawngreen\">", "<b style=\"background:aquamarine\">", "<b style=\"background:magenta\">", "<b style=\"background:palegreen\">", "<b style=\"background:coral\">", "<b style=\"background:wheat\">", "<b style=\"background:khaki\">", "<b style=\"background:lime\">", "<b style=\"background:deepskyblue\">" }; {code} A sample picture will be attached shortly. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-1522: ----------------------------------- Attachment: colored-tag-sample.png > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681249#action_12681249 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- This highlighter looks very interesting! I love the colored tags, and the fast performance on large docs, and the extensive unit tests. When I applied the patch to current trunk, I see some tests failing, eg: {code} [junit] Testcase: test1PhraseLongMVB(org.apache.lucene.search.highlight2.FieldPhraseListTest): FAILED [junit] expected:<sppeeeed(1.0)((8[8,93]))> but was:<sppeeeed(1.0)((8[7,92]))> [junit] junit.framework.ComparisonFailure: expected:<sppeeeed(1.0)((8[8,93]))> but was:<sppeeeed(1.0)((8[7,92]))> [junit] at org.apache.lucene.search.highlight2.FieldPhraseListTest.test1PhraseLongMVB(FieldPhraseListTest.java:175) {code} Is this approach guaranteed to only highlight term occurrences that actually contribute to the document match? Can it handle all / arbitrary Query subclasses? How does it score fragments? I also like that you first generate hits in the document, and from those hits you generate fragments (if I'm reading the code correctly); this is a nicely scalable approach. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681264#action_12681264 ] Koji Sekiguchi commented on LUCENE-1522: ---------------------------------------- {quote} This highlighter looks very interesting! I love the colored tags, and the fast performance on large docs, and the extensive unit tests. {quote} Thank you for paying attention on this issue, Mike! bq. When I applied the patch to current trunk, I see some tests failing, Note that this issue depends on LUCENE-1448, so you apply LUCENE-1448.patch first, then apply LUCENE-1522.patch. {noformat} # To apply LUCENE-1448.patch, check out revision 713975!!! $ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk $ cd trunk $ patch -p0 < LUCENE-1448.patch $ patch -p0 < LUCENE-1522.patch {noformat} I'll post comment later for the rest of your questions. :) > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681282#action_12681282 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- bq. Note that this issue depends on LUCENE-1448 Woops, right I had skipped that step. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1522: --------------------------------------- Fix Version/s: 2.9 > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Assigned: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1522: ------------------------------------------ Assignee: Michael McCandless > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681517#action_12681517 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- Does this highlighter have a "max tokens to analyze" setting? Or does it always visit all terms in each document? > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681531#action_12681531 ] Mark Harwood commented on LUCENE-1522: -------------------------------------- I'm guessing that's not an issue given it uses stored TermVectors rather than re-analyzing? At some stage I hope to take a closer look at this contribution. I'd be interested to see if all the Highlighter1 Junit tests could be adapted to work with Highlighter2 and get some comparative benchmarks. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682111#action_12682111 ] Koji Sekiguchi commented on LUCENE-1522: ---------------------------------------- Mike, I'm sorry for late reply. bq. Is this approach guaranteed to only highlight term occurrences that actually contribute to the document match? I'm not sure if I understand what you are asking, but if you talk about "hl.requireFieldMatch feature in Solr", YES. highlighter2 has the feature: {code:java} /** * a constructor. A FragListBuilder and a FragmentsBuilder can be specified (plugins). * * @param phraseHighlight true of false for phrase highlighting * @param fieldMatch true of false for field matching * @param fragListBuilder an instance of FragListBuilder * @param fragmentsBuilder an instance of FragmentsBuilder */ public Highlighter( boolean phraseHighlight, boolean fieldMatch, FragListBuilder fragListBuilder, FragmentsBuilder fragmentsBuilder ){ this.phraseHighlight = phraseHighlight; this.fieldMatch = fieldMatch; this.fragListBuilder = fragListBuilder; this.fragmentsBuilder = fragmentsBuilder; } {code} bq. Can it handle all / arbitrary Query subclasses? Currently, no. Highlighter2 calls flatten() method to try to flat the sourceQuery in the beginning. In flatten() method, it recognizes TermQuery and PhraseQuery, and BooleanQuery that contains TermQuery and PhraseQuery: {code:title=FieldQuery.java} void flatten( Query sourceQuery, Collection<Query> flatQueries ){ if( sourceQuery instanceof BooleanQuery ){ BooleanQuery bq = (BooleanQuery)sourceQuery; for( BooleanClause clause : bq.getClauses() ){ if( !clause.isProhibited() ) flatten( clause.getQuery(), flatQueries ); } } else if( sourceQuery instanceof TermQuery ){ if( !flatQueries.contains( sourceQuery ) ) flatQueries.add( sourceQuery ); } else if( sourceQuery instanceof PhraseQuery ){ if( !flatQueries.contains( sourceQuery ) ){ PhraseQuery pq = (PhraseQuery)sourceQuery; if( pq.getTerms().length > 1 ) flatQueries.add( pq ); else if( pq.getTerms().length == 1 ){ flatQueries.add( new TermQuery( pq.getTerms()[0] ) ); } } } // else discard queries } {code} But I'm always positive to support all / arbitrary Query subclasses in H2. :) bq. How does it score fragments? Currently, H2 takes into account query time boost and tf in fragment. For example, if we have q="a OR b^3" and two fragment candidates f1="a a a" and f2="a b", f1 gets 3 and f2 gets 4, getBestFragments() will return f2 first, then f1 when ScoreOrderFragmentsBuilder (default) is used. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682112#action_12682112 ] Koji Sekiguchi commented on LUCENE-1522: ---------------------------------------- Mark, bq. I'm guessing that's not an issue given it uses stored TermVectors rather than re-analyzing? Correct. bq. At some stage I hope to take a closer look at this contribution. Very nice! bq. I'd be interested to see if all the Highlighter1 Junit tests could be adapted to work with Highlighter2 and get some comparative benchmarks. I'm not sure all H1 test cases could be adapted to work with H2 because boundary of fragments will be different between H1 and H2, but benchmarks of performance is on my todo list. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682270#action_12682270 ] Michael McCandless commented on LUCENE-1522: -------------------------------------------- bq. I'm not sure if I understand what you are asking, but if you talk about "hl.requireFieldMatch feature in Solr", YES. highlighter2 has the feature: Actually I was asking whether every fragment that's returned is guaranteed to show a match to my original query. EG if my query is a PhraseQuery, is it guaranteed that all fragments presented are valid matches? If I search for "Alan Greenspan's mortgage", is it ever possible to see a fragment that contains only "Alan Greenspan"? bq. Currently, no. Highlighter2 calls flatten() method to try to flat the sourceQuery in the beginning. In flatten() method, it recognizes TermQuery and PhraseQuery, and BooleanQuery that contains TermQuery and PhraseQuery: OK so eg *SpanQuery won't work? It seems like both highlighters take this "flatten" approach, which can lose the constraints for interesting queries (like Span, or a custom query). I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). EG ExactPhraseScorere.phraseFreq has the logic to check term positions and find all positions where the phrase matches. Right now that method throws away the specific position where each match occurred, but if instead we had it call a normally no-op method (recordDocMatchPosition(int position, float score) or some such), we could then make use of it for highlighting. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682351#action_12682351 ] Koji Sekiguchi commented on LUCENE-1522: ---------------------------------------- {quote} Actually I was asking whether every fragment that's returned is guaranteed to show a match to my original query. EG if my query is a PhraseQuery, is it guaranteed that all fragments presented are valid matches? If I search for "Alan Greenspan's mortgage", is it ever possible to see a fragment that contains only "Alan Greenspan"? {quote} I see. Yes, H2 guarantees it. {quote} OK so eg *SpanQuery won't work? It seems like both highlighters take this "flatten" approach, which can lose the constraints for interesting queries (like Span, or a custom query). {quote} H2 doesn't support SpanQuery right now. I'll look at SpanScorer and LUCENE-1425 to see whether I can support "interesting queries" in H2, before going to "eventual model" (looks great) you mentioned above. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682387#action_12682387 ] Mark Miller commented on LUCENE-1522: ------------------------------------- I don't think its easy to get a speedy highlighter that works with positions for all of the Lucene queries. In the long term, I'd love to see a fast highlighter that works with positions for all of Lucene's queries . I'd also like it to work if you don't have termvectors stored (though be faster if they are perhaps, as it is now). Essentially we have each of these pieces separately now - the difficulty is doing it with one highlighter. We have the standard Highlighter with two modes: one that doesn't handle positions, and one that handles position sensitive highlighting for Spans and almost all of the queries. This framework is great - its customizable, it handles a lot of corner cases, it works without termvectors, it gets faster with termvectors. Unfortunately, it runs through the source stream one token at a time, and doesn't scale well. Getting hit positions for position sensitive clauses requires converting the query to a span query and calling getSpans on a memory index We also have the termvector highlighters that can work from offsets in the query and avoid running through a token at a time. You need termvectors for this approach, and its difficult to handle positions, but it scales. The difficulty and goal is in merging the qualities of both. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682390#action_12682390 ] Mark Miller commented on LUCENE-1522: ------------------------------------- {quote}I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today).{quote} Variations on this have been tossed around before, but this sounds like a slightly more interesting approach than whats been mentioned. Its sort of how the current highlighter handles positions, but avoids the messy step of trying to convert any query to a spanquery. Not sure it solves being able to gets offsets from the query terms and still mask for positions though - if that step can be completed, we can start by using the current SpanScorer logic with this patch, until we get the pieces into core Lucene. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1522) another highlighter[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682422#action_12682422 ] Michael Busch commented on LUCENE-1522: --------------------------------------- {quote} I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). {quote} But, if you have for example a document 'a b c a b c' and the query 'a AND b', then this approach would only highlight the first two terms, no? > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 - 4 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |