[jira] Created: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

[jira] Created: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Regex support and beyond in JavaCC QueryParser
----------------------------------------------

                 Key: LUCENE-2039
                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
             Project: Lucene - Java
          Issue Type: Improvement
          Components: QueryParser
            Reporter: Simon Willnauer
            Priority: Minor
             Fix For: 3.1


Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.

The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.

Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
{code}
protected Query newRegexQuery(Term t) {
  ...
}
{code}

which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.

I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Updated: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2039:
------------------------------------

    Attachment: LUCENE-2039.patch

attached extension based patch

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774364#action_12774364 ]

Uwe Schindler commented on LUCENE-2039:
---------------------------------------

Wrrrr brrrr grrrr gnarf

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774364#action_12774364 ]

Uwe Schindler edited comment on LUCENE-2039 at 11/6/09 8:12 PM:
----------------------------------------------------------------

I do not like this extension.

In my opinion, we should simply use the new QueryParser framework for it, where it is quite easy to plugin support for RegExQueries even if they live in contrib.

      was (Author: thetaphi):
    Wrrrr brrrr grrrr gnarf
 

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774568#action_12774568 ]

Grant Ingersoll commented on LUCENE-2039:
-----------------------------------------

The new QP framework is not proven out and doesn't have very many people using it and is still in contrib.  This extension allows for a pretty simple way for people to add simple extensions to the current QP without having to do a whole lot of programming.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776023#action_12776023 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

I agree with Uwe,

I think we should implement this on the new queryparser using the opaque terms framework described in LUCENE-1823.

The current implementation of this patch will create backward compatibility syntax problems, for queries using "/" characters
for example "file paths" or "urls" would be affected. If we are doing this we should change the syntax to allow for opaque terms.

When we have support for opaque terms in the new queryparser, we can implement regex support with it.

Opaque terms, is a framework to extend the queryparser syntax to bypass parts of the query  to a smaller parsing code (not a full parser), or a analyzer, and allow extensions of the query syntax as needed, without requiring changing the lucene code.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776043#action_12776043 ]

Grant Ingersoll commented on LUCENE-2039:
-----------------------------------------

I have a need for this in the Lucene Query Parser.  It simply isn't practical for me to switch to using the contrib Query Parser as that would involve a fair amount of changes in the application.  As for the back compat issue, I think we can work around that by having a flag set.  I'll look into it a bit more.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Assigned: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-2039:
---------------------------------------

    Assignee: Grant Ingersoll

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776048#action_12776048 ]

Robert Muir commented on LUCENE-2039:
-------------------------------------

regardless of which query parser, I think it would be nice to have regex support in some query parser available.

doesn't query parser now take Version as a required argument? Maybe the back compat issue could be solved with that???

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776050#action_12776050 ]

Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

I totally see you point but on the other hand I really miss the option to extend the old-fashion query parser. I do not see the new parser being THE lucene query parser by now.Many many people are using the javaCC parser and will do so in the future. I possibly have another solution which preserves backwards compatibility and would support the query extension too.

The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.\*$" -> (re_field, RegexExtension)

that would not change anything in the parser as long as no extension is registered. No new character and no backwards compat issues.

Thoughts?

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776057#action_12776057 ]

Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

bq. I think we can work around that by having a flag set. I'll look into it a bit more.

Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks. First step would be to build a tree using jjtree. Then you need to build the symbol table and then you can traverse the tree to do your checks.

One solution would be creating a parser from two javacc files one for < 3.0 and one or 3.0 - something like robert suggested. Then we could use the Version to choose the corresponding parser impl.

simon

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776058#action_12776058 ]

Robert Muir commented on LUCENE-2039:
-------------------------------------

Simon, personally I would prefer the Version argument used for such things.

I know this isn't popular, but I'd actually be for having say, a 3.0 javacc grammar file that differs from the 2.9 one, with version driving it.

yeah it would be duplicated code, but its mostly auto-generated code anyway, and I think it would be simple to understand what is going on.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776123#action_12776123 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

Hi Simon,

I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code. If we continue to do that it will always be very difficult to create a standard way of making small changes to the current queryparser.

I like the implementation proposed by Simon, is very similar to the opaque term idea, but I would prefer not to overload the fileds names.
{quote}
The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.*$" -> (re_field, RegexExtension)
{quote}

We should decouple the user extensions from the JAVACC generated code. Just like in the new queryparser framework, the queryparser should allow for the user to register these extensions at run time, and have Interface that implement that extensions should implement.

For example, something like this:
{code}
QueryParser  qp = QueryParserFactory.getInstance("3.0");
qp.registerOpaqueTerm("regexp", new QueryParserRegExpParser());
qp.registerOpaqueTerm("complex_phrases", new QueryParserComplexPhraseParser());
...
qp.parser(" regexp:\"/blah*/\" complex_phrase:\"(sun OR sunny) sky\" ",...);
{code}
Of course this is not possible with the lucene queryparser code today :(,
but this is the idea I think we should try to implement.

For the problem of field overload, is that we lose the field name information for the extensions, so we need to another solution that would allow the fieldname to be available for the extensions.

Here is another idea, that would allow for fieldnames not to be overloaded,
and allow regular term or phrase syntax for extensions.
{code}
syntax:
extension:fieldname:"syntax"

examples:
regexp:title:"/blah[a-z]+[0-9]+/"  <- regexp extension, title index field
complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field

regexp_phrase::"/blah[a-z]+[0-9]+/"  <- regexp extension, default field
complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

title:"blah" <- regular field query

{code}



> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776123#action_12776123 ]

Luis Alves edited comment on LUCENE-2039 at 11/10/09 10:33 PM:
---------------------------------------------------------------

Hi Simon,

I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code. If we continue to do that it will always be very difficult to create a standard way of making small changes to the current queryparser.

I like the implementation proposed by Simon, is very similar to the opaque term idea, but I would prefer not to overload the fileds names.
{quote}
The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.*$" -> (re_field, RegexExtension)
{quote}

We should decouple the user extensions from the JAVACC generated code. Just like in the new queryparser framework does, the queryparser should allow for the user to register these extensions at run time, and have Interface that extensions should implement.

For example, something like this:
{code}
QueryParser  qp = QueryParserFactory.getInstance("3.0");
qp.registerOpaqueTerm("regexp", new QueryParserRegExpParser());
qp.registerOpaqueTerm("complex_phrases", new QueryParserComplexPhraseParser());
...
qp.parser(" regexp:\"/blah*/\" complex_phrase:\"(sun OR sunny) sky\" ",...);
{code}
Of course this is not possible with the lucene queryparser code today :(,
but this is the idea I think we should try to implement.

For the problem of field overload:
In your proposal we lose the field name information for the extensions, so we need to another solution that would allow the fieldname to be available for the extensions.

Here is another idea, that would allow for fieldnames not to be overloaded,
and allow regular term or phrase syntax for extensions.
{code}
syntax:
extension:fieldname:"syntax"

examples:
regexp:title:"/blah[a-z]+[0-9]+/"  <- regexp extension, title index field
complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field

regexp_phrase::"/blah[a-z]+[0-9]+/"  <- regexp extension, default field
complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

title:"blah" <- regular field query

{code}



      was (Author: lafa):
    Hi Simon,

I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code. If we continue to do that it will always be very difficult to create a standard way of making small changes to the current queryparser.

I like the implementation proposed by Simon, is very similar to the opaque term idea, but I would prefer not to overload the fileds names.
{quote}
The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.*$" -> (re_field, RegexExtension)
{quote}

We should decouple the user extensions from the JAVACC generated code. Just like in the new queryparser framework, the queryparser should allow for the user to register these extensions at run time, and have Interface that implement that extensions should implement.

For example, something like this:
{code}
QueryParser  qp = QueryParserFactory.getInstance("3.0");
qp.registerOpaqueTerm("regexp", new QueryParserRegExpParser());
qp.registerOpaqueTerm("complex_phrases", new QueryParserComplexPhraseParser());
...
qp.parser(" regexp:\"/blah*/\" complex_phrase:\"(sun OR sunny) sky\" ",...);
{code}
Of course this is not possible with the lucene queryparser code today :(,
but this is the idea I think we should try to implement.

For the problem of field overload, is that we lose the field name information for the extensions, so we need to another solution that would allow the fieldname to be available for the extensions.

Here is another idea, that would allow for fieldnames not to be overloaded,
and allow regular term or phrase syntax for extensions.
{code}
syntax:
extension:fieldname:"syntax"

examples:
regexp:title:"/blah[a-z]+[0-9]+/"  <- regexp extension, title index field
complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field

regexp_phrase::"/blah[a-z]+[0-9]+/"  <- regexp extension, default field
complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

title:"blah" <- regular field query

{code}


 

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776139#action_12776139 ]

Luis Alves edited comment on LUCENE-2039 at 11/10/09 11:00 PM:
---------------------------------------------------------------

{quote}
Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks.
First step would be to build a tree using jjtree.
Then you need to build the symbol table and then you can traverse the tree to do your checks.
{quote}

In the new queryparser we don't use jjtree, but the same concept is implemented in the new queryparser,
the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with any lucene objects just like jjtree.
But I think this is a ugly solution.

I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same Processors and the Builders.
And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc), you can use the same Processors and Builders to create a lucene query.
This will avoid duplicate code and allow for multiple syntaxes.

I don't want to be preacher here, but some of these problems are already solved in the new queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions and features to it.

I know the new queryparser is not in main, but that can be fixed in 3.1, if the community thinks is stable we should move it there.



      was (Author: lafa):
    {code}
Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks. First step would be to build a tree using jjtree. Then you need to build the symbol table and then you can traverse the tree to do your checks.
{code}

In the new queryparser we don't use jjtree, but the same concept is implemented in the new queryparser,
the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with any lucene objects just like jjtree.
But I think this is a ugly solution.

I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same Processors and the Builders.
And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc), you can use the same Processors and Builders to create a lucene query.
This will avoid duplicate code and allow for multiple syntaxes.

I don't want to be preacher here, but some of these problems are already solved in the new queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions and features to it.

I know the new queryparser is not in main, but that can be fixed in 3.1, if the community thinks is stable we should move it there.


 

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776139#action_12776139 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

{code}
Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks. First step would be to build a tree using jjtree. Then you need to build the symbol table and then you can traverse the tree to do your checks.
{code}

In the new queryparser we don't use jjtree, but the same concept is implemented in the new queryparser,
the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with any lucene objects just like jjtree.
But I think this is a ugly solution.

I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same Processors and the Builders.
And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc), you can use the same Processors and Builders to create a lucene query.
This will avoid duplicate code and allow for multiple syntaxes.

I don't want to be preacher here, but some of these problems are already solved in the new queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions and features to it.

I know the new queryparser is not in main, but that can be fixed in 3.1, if the community thinks is stable we should move it there.



> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776139#action_12776139 ]

Luis Alves edited comment on LUCENE-2039 at 11/10/09 11:02 PM:
---------------------------------------------------------------

{quote}
Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks.
First step would be to build a tree using jjtree.
Then you need to build the symbol table and then you can traverse the tree to do your checks.
{quote}

In the new queryparser we don't use jjtree, but the same concept is implemented in the new queryparser,
the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with any lucene objects just like jjtree.
But I think this is a ugly solution.

I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same Processors and the Builders.
And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc), you can use the same Processors and Builders to create a lucene query.
This will avoid duplicate code and allow for multiple syntaxes.

I don't want to be preacher here, but some of these problems are already solved in the new queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions and features to it.

I know the new queryparser is not in main, but that can be fixed in 3.1.
If the community thinks it is stable, we should move it to main.


      was (Author: lafa):
    {quote}
Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks.
First step would be to build a tree using jjtree.
Then you need to build the symbol table and then you can traverse the tree to do your checks.
{quote}

In the new queryparser we don't use jjtree, but the same concept is implemented in the new queryparser,
the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with any lucene objects just like jjtree.
But I think this is a ugly solution.

I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same Processors and the Builders.
And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc), you can use the same Processors and Builders to create a lucene query.
This will avoid duplicate code and allow for multiple syntaxes.

I don't want to be preacher here, but some of these problems are already solved in the new queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions and features to it.

I know the new queryparser is not in main, but that can be fixed in 3.1, if the community thinks is stable we should move it there.


 

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776502#action_12776502 ]

Yonik Seeley commented on LUCENE-2039:
--------------------------------------

bq. I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code.

This almost seems more of an issue for core lucene developers - it's an annoyance that one needs to recompile the javacc grammar when just tweaking what one of the methods does.  Seems like this could easily be solved by just separating into two files... the javacc grammar would have a base class that left things like getFieldQuery() unimplemented, and then the standard QueryParser (in a different java file) would override and implement those methods.

bq. We should decouple the user extensions from the JAVACC generated code.

It already is today via subclassing QueryParser and overriding methods like getFieldQuery... that's very simple for users to understand and to leverage.

bq. Just like in the new queryparser framework does, the queryparser should allow for the user to register these extensions at run time, and have Interface that extensions should implement.

I don't understand the motivation for this - it's complex and harder for a user to understand.  Java's own extension mechanism (overriding) has worked perfectly fine in the past.


> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776668#action_12776668 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

Hi Yonik,

{quote}
This almost seems more of an issue for core lucene developers - it's an annoyance that one needs to recompile the javacc grammar when just tweaking what one of the methods does. Seems like this could easily be solved by just separating into two files... the javacc grammar would have a base class that left things like getFieldQuery() unimplemented, and then the standard QueryParser (in a different java file) would override and implement those methods.
{quote}

This solution does not fix the problem of having multiple syntaxes sharing the same lucene processing code. For example if you have one javacc grammar and one in antlr, you can't use lucene QueryParser, to process the output of both. You will need to re-implement the QueryParser recursive logic in a diff class to be able to use antlr.

{quote}
It already is today via subclassing QueryParser and overriding methods like getFieldQuery... that's very simple for users to understand and to leverage.
{quote}

True. This is simple, but is not customizable.
- You can't change the syntax.
- You can't reuse the QueryParser logic with other parsers
- If you do have to change syntax, you can't reuse QueryParser class anymore, you need to maintain your own copy of the class.

You can read LUCENE-1567 to understand the reasons for the new queryparser.
But the focus of the new queryparser is extensibility and customization,
without changing lucene code, but reusing lucene logic as much as possible.

If you look at TestSpanQueryParserSimpleSample in queryparser contrib, or LUCENE-1938 Precedence query parser.
It illustrates two cases that would be very difficult to do in the current QueryParser in lucene by overriding methods.

Actually the a implementation  PrecedenceQueryParser exists today in contrib/misc. That contains a seperated javacc grammar and does not share any code with the main lucene Queryparser, and it illustrates the problem I described above (code duplication, impossible to reuse if grammar is different, easily gets outdated when the core queryparser changes)

I'm not trying to say the QueryParser in main is worst than the one in contrib,

What I'm trying to describe is that the one in contrib is more modular and if we build the modules
for the lucene users. The users will be able to build smarter and more sophisticated solutions using Lucene in less time.
Users can decide what modules to use in the queryparser and build their query pipelines with less work.

Users can also use the pre-built ones like StandardQueryParser or PrecedenceQueryParser, these should be as easy to use as the old queryparser in main.



> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...


[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776884#action_12776884 ]

Adriano Crestani commented on LUCENE-2039:
------------------------------------------

This is a new feature already suggested by Luis and Shai (maybe others too) before, the ability to delegate to another parser the syntax processing of certain piece of the query string. This feature is a new feature to both: core QP and contrib QP.

So, I think we should focus more on how/when a query substring will be delegated to another parser and not discuss about how/when any logic will be applied to it. I think in both QPs, this part is already defined.

First, to identify this substring we would need a open and close token. It could be either double-quote, slash or whatever. The ideal solution would allow the user to specify these two tokens. Unfortunately, I think JavaCC is not so flexible to allow defining these tokens programatically (after parser generation by JavaCC). So we need to stick with some specific open/close token, that's one decision we need to take. Maybe we could provide a property file, where the user could specify the open/close token and regenerate Lucene QP using 'ant javacc' (which is pretty easy today). Anyway, by default, we could use any new token. I don't agree with double-quotes (as I think someone suggested), it's already used by phrases, so, slash is fine for me, as already defined in Simon's patch.

Now, about any semantic(logic) processing performed on any query substring, it will be up to the QP implementation. In the core QP, its own extension would be responsible to do this processing. In the contrib QP, the extension parser would only parse the substring and return a QueryNode, which will be later processed, after the syntax parsing is complete, by the query node processors. As I said before, this part is defined and I don't think we should discuss it on this topic.

I like Simon's patch, I think the same approach can be applied to the contrib QP. The only part I disagree is when you pass the fieldname to the extension parser, I wouldn't implement that on the contrib parser, because it assumes the syntax always has field names. Anyway, for the core QP, I see the reason why you pass the fieldname, and it's completely related to the way the core QP implements the semantic (logic) processing. So, in future, if the main core QP needs to pass a new info to its extension parser, the extension parser interface would have to be changed :S...here I go again starting a new discussion about how semantic (logic) processing should be handled :P

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash  '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@...
For additional commands, e-mail: java-dev-help@...

< Prev | 1 - 2 | Next >