Order of definitions in source-highlight 2.10

View: New views
9 Messages — Rating Filter:   Alert me  

Order of definitions in source-highlight 2.10

by gnombat :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I just upgraded source-highlight to 2.10 and I am noticing some strange
behavior.

Suppose we have the file foo.lang:

symbol = "/"
comment start "//"

And the file test.foo:

// foo

The language definition is taken from the source-highlight manual,
section 7.4: "Order of definitions".  Note that the definitions are in
the wrong order, according to the manual: "The first expression will
always be matched first, and the second expression will never be
matched."  And yet:

$ source-highlight --lang-def=foo.lang -c foo.css --no-doc -i test.foo
<!-- Generator: GNU source-highlight 2.10
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
<pre><tt><span class="comment">// foo</span>
</tt></pre>

This was different with version 2.9:

$ source-highlight --lang-def=foo.lang -c foo.css --no-doc -i test.foo
<!-- Generator: GNU source-highlight 2.9
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
<pre><tt><span class="symbol">//</span><span class="normal"> foo</span>
</tt></pre>

What has changed between version 2.9 and 2.10?


_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight

Re: Order of definitions in source-highlight 2.10

by Lorenzo Bettini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

gnombat@... wrote:

> I just upgraded source-highlight to 2.10 and I am noticing some strange
> behavior.
>
> Suppose we have the file foo.lang:
>
> symbol = "/"
> comment start "//"
>
> And the file test.foo:
>
> // foo
>
> The language definition is taken from the source-highlight manual,
> section 7.4: "Order of definitions".  Note that the definitions are in
> the wrong order, according to the manual: "The first expression will
> always be matched first, and the second expression will never be
> matched."  And yet:
>
> $ source-highlight --lang-def=foo.lang -c foo.css --no-doc -i test.foo
> <!-- Generator: GNU source-highlight 2.10
> by Lorenzo Bettini
> http://www.lorenzobettini.it
> http://www.gnu.org/software/src-highlite -->
> <pre><tt><span class="comment">// foo</span>
> </tt></pre>
>
> This was different with version 2.9:
>
> $ source-highlight --lang-def=foo.lang -c foo.css --no-doc -i test.foo
> <!-- Generator: GNU source-highlight 2.9
> by Lorenzo Bettini
> http://www.lorenzobettini.it
> http://www.gnu.org/software/src-highlite -->
> <pre><tt><span class="symbol">//</span><span class="normal"> foo</span>
> </tt></pre>
>
> What has changed between version 2.9 and 2.10?

Hi

yes, the strategy for regular expression matching has changed: before it
used to build a huge regular expression with many alternatives; however,
this would make the handling of things such as backreferences a real
nightmare (since the number of backreference would have to be updated,
and the number of backreferences is limited to 9), in particular it
required to split regular expressions and the code was really buggy.

so in 2.10 I completely re-written the handling of regular expressions
(http://www.gnu.org/software/src-highlite/source-highlight.html#fn-29); 
in particular, now each element has its own regular expression and the
engine tests each expression and, as explained in 7.12:

"As hinted at the beginning of Language Definitions, source-highlight
uses the definitions in the language definition file to internally
create, on-the-fly, regular expressions that are used to highlight the
tokens of an input file. Here we provide some internal details that are
crucial to understand how to write language definition files correctly29.

First of all, each element definition, an highlighting rule is created
by source-highlight (even if they correspond to the same language
element); thus, each language definition file will correspond to a list
of highlighting rules. For each line of the input file, source-highlight
will try to match all these rules against the whole line (more formally,
against the part of the line that has not been highlighted yet). It will
not stop as soon as an highlighting rule matched, since there might be
another rule that matches “better”.

The strategy used by source-highlight is to select the first rule that
matches the longest part of the text with the smallest prefix (i.e., the
initial part of the line that contains no language element). (Thus, as
already noted in the previous sections, the order of language
definitions is crucial.) Then, it will continue to search for another
matching rule for the remaining part of the line."

So the case of / and // respects this rule, since // matches better than /.

Of course, you're right: the example of 7.4 does not work anymore and I
have to update the documentation with a better example!  Sorry about
that, and thanks for the bug report.

Does this new strategy pose problems for your language definition?

hope to hear from you soon
cheers
        Lorenzo

--
Lorenzo Bettini, PhD in Computer Science, DI, Univ. Torino
ICQ# lbetto, 16080134     (GNU/Linux User # 158233)
HOME: http://www.lorenzobettini.it MUSIC: http://www.purplesucker.com
http://www.myspace.com/supertrouperabba
BLOGS: http://tronprog.blogspot.com  http://longlivemusic.blogspot.com
http://www.gnu.org/software/src-highlite
http://www.gnu.org/software/gengetopt
http://www.gnu.org/software/gengen http://doublecpp.sourceforge.net



_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight

Re: Order of definitions in source-highlight 2.10

by gnombat :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Lorenzo Bettini wrote:
> Does this new strategy pose problems for your language definition?

I was using the old definition from function.lang, which can cause
problems (as explained in section 7.12).  Changing to the new definition
seems to fix things.  Thanks!



_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight

Re: Order of definitions in source-highlight 2.10

by Lorenzo Bettini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

gnombat@... wrote:
> Lorenzo Bettini wrote:
>> Does this new strategy pose problems for your language definition?
>
> I was using the old definition from function.lang, which can cause
> problems (as explained in section 7.12).  Changing to the new definition
> seems to fix things.  Thanks!
>

OK, happy to hear that :-)

However, you're right about the documentation: it must be updated, since
that example does not hold anymore.

cheers
        Lorenzo

--
Lorenzo Bettini, PhD in Computer Science, DI, Univ. Torino
ICQ# lbetto, 16080134     (GNU/Linux User # 158233)
HOME: http://www.lorenzobettini.it MUSIC: http://www.purplesucker.com
http://www.myspace.com/supertrouperabba
BLOGS: http://tronprog.blogspot.com  http://longlivemusic.blogspot.com
http://www.gnu.org/software/src-highlite
http://www.gnu.org/software/gengetopt
http://www.gnu.org/software/gengen http://doublecpp.sourceforge.net



_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight

Re: Order of definitions in source-highlight 2.10

by Lorenzo Bettini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

gnombat@... wrote:

> I just upgraded source-highlight to 2.10 and I am noticing some strange
> behavior.
>
> Suppose we have the file foo.lang:
>
> symbol = "/"
> comment start "//"
>
> And the file test.foo:
>
> // foo
>
> The language definition is taken from the source-highlight manual,
> section 7.4: "Order of definitions".  Note that the definitions are in
> the wrong order, according to the manual: "The first expression will
> always be matched first, and the second expression will never be
> matched."  And yet:
>
> $ source-highlight --lang-def=foo.lang -c foo.css --no-doc -i test.foo
> <!-- Generator: GNU source-highlight 2.10
> by Lorenzo Bettini
> http://www.lorenzobettini.it
> http://www.gnu.org/software/src-highlite -->
> <pre><tt><span class="comment">// foo</span>
> </tt></pre>
>
> This was different with version 2.9:
>
> $ source-highlight --lang-def=foo.lang -c foo.css --no-doc -i test.foo
> <!-- Generator: GNU source-highlight 2.9
> by Lorenzo Bettini
> http://www.lorenzobettini.it
> http://www.gnu.org/software/src-highlite -->
> <pre><tt><span class="symbol">//</span><span class="normal"> foo</span>
> </tt></pre>
>
> What has changed between version 2.9 and 2.10?
Hi there

as I had already written in the previous email, the matching strategy
changed between 2.9 and 2.10:

"The strategy used by source-highlight is to select the first rule that
matches the longest part of the text with the smallest prefix (i.e., the
initial part of the line that contains no language element). (Thus, as
already noted in the previous sections, the order of language
definitions is crucial.)"

however, when working on the documentation, I actually realized that
this strategy is too involved and a little bit confusing, not to mention
that it has a lot of overhead, since it tests ALL the rules in a state.

Then, I realized that basically the rule that should be selected is the
one with the smallest prefix, but we could stop testing rules as soon as
we find a rule that matches and whose prefix (i.e., the part of the
string before the matched one) contains only spaces (or it's empty).  I
think this is also the strategy used by standard regular expression
engines, or at least, this one seems to be enough for programming languages.

Thus, for instance, if I have

i = null;

if I match null as a keyword, its prefix is "i = " and I should not stop
testing other rules, since otherwise I would not test the symbol rule
(that is defined later).

While, if I have

    if (exp)

as soon as I match "if" as a keyword, since its prefix is "   ", I can
stop testing other rules (this way, I don't even risk to match "if(exp)"
as a function call (note that with the previous strategy this would
match better since it matches more characters).

I think this is the right strategy and it brings the example in the
documentation to work again as described.

I've uploaded a temporary version that uses this strategy (and it also
performs faster as expected) here:

http://gdn.dsi.unifi.it/~bettini/source-highlight-2.10.1.tar.gz

I'd really appreciate to get some feedback, especially do you think that
this new strategy makes sense?

There's also a new test in the tests directory: test_string_stop.lang:

keyword = "if|class"

type = 'int'

comment delim "/*" "*/"

# thus this won't catch "/* */ /" as a regexp,
# since comment elem definition comes first
regexp = '/.*/.*/'

# this won't match if ( ) as a function,
# since keyword elem definition comes first
function = '([[:alpha:]]|_)[[:word:]]*[[:blank:]]*\(*[[:blank:]]*\)'

# the following order is conceptually wrong,
# since "//" won't be highlighted as a comment, but as two symbols
symbol = "/"
comment start "//"

which can be used with the input file test_string_stop.java, which
produces the attached output, which is the one expected with the new
strategy.

cheers
        Lorenzo

--
Lorenzo Bettini, PhD in Computer Science, DI, Univ. Torino
ICQ# lbetto, 16080134     (GNU/Linux User # 158233)
HOME: http://www.lorenzobettini.it MUSIC: http://www.purplesucker.com
http://www.myspace.com/supertrouperabba
BLOGS: http://tronprog.blogspot.com  http://longlivemusic.blogspot.com
http://www.gnu.org/software/src-highlite
http://www.gnu.org/software/gengetopt
http://www.gnu.org/software/gengen http://doublecpp.sourceforge.net

/* comment */ final /
/my/regexp/
  if ( ) {
    class;
    myfun ( );
  }
  int i;
  int ( );
// comment? or two symbols?


_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight

Re: Order of definitions in source-highlight 2.10

by gnombat :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Lorenzo Bettini wrote:
> I've uploaded a temporary version that uses this strategy (and it also
> performs faster as expected) here:
>
> http://gdn.dsi.unifi.it/~bettini/source-highlight-2.10.1.tar.gz
>
> I'd really appreciate to get some feedback, especially do you think that
> this new strategy makes sense?

On a totally unrelated note, I notice that this new version has some
changes to the search for Boost in the configure script - which is good,
because I've never been able to get it to compile without passing a
bunch of options to configure :)  But shouldn't there be a reference to
BOOST_CPPFLAGS in src/Makefile.am (and src/lib/Makefile.am)?


_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight

Re: Order of definitions in source-highlight 2.10

by Lorenzo Bettini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

gnombat@... wrote:

> Lorenzo Bettini wrote:
>> I've uploaded a temporary version that uses this strategy (and it also
>> performs faster as expected) here:
>>
>> http://gdn.dsi.unifi.it/~bettini/source-highlight-2.10.1.tar.gz
>>
>> I'd really appreciate to get some feedback, especially do you think
>> that this new strategy makes sense?
>
> On a totally unrelated note, I notice that this new version has some
> changes to the search for Boost in the configure script - which is good,
> because I've never been able to get it to compile without passing a

yes, I didn't mention that :-)
finally the autoconf macro for searching for that libary improved and
can find that library in a smarter way :-)
http://autoconf-archive.cryp.to/ax_boost_regex.html

> bunch of options to configure :)  But shouldn't there be a reference to
> BOOST_CPPFLAGS in src/Makefile.am (and src/lib/Makefile.am)?

yes, probably you're right, I should add that!
The documentation found here http://randspringer.de/boost/ucl-sbs.html 
did not mention that, but I think I should add BOOST_CPPFLAGS to
AM_CPPFLAGS.

What about the other changes?
Do you think they make sense?

cheers
        Lorenzo

--
Lorenzo Bettini, PhD in Computer Science, DI, Univ. Torino
ICQ# lbetto, 16080134     (GNU/Linux User # 158233)
HOME: http://www.lorenzobettini.it MUSIC: http://www.purplesucker.com
http://www.myspace.com/supertrouperabba
BLOGS: http://tronprog.blogspot.com  http://longlivemusic.blogspot.com
http://www.gnu.org/software/src-highlite
http://www.gnu.org/software/gengetopt
http://www.gnu.org/software/gengen http://doublecpp.sourceforge.net


_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight

Re: Order of definitions in source-highlight 2.10

by gnombat :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Lorenzo Bettini wrote:
> What about the other changes?
> Do you think they make sense?

It seems to work well.  If my understanding is correct, the end result
should be the same as it was with versions 2.9 and earlier (although the
implementation is now quite different).



_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight

Re: Order of definitions in source-highlight 2.10

by Lorenzo Bettini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

gnombat@... wrote:
> Lorenzo Bettini wrote:
>> What about the other changes?
>> Do you think they make sense?
>
> It seems to work well.  If my understanding is correct, the end result
> should be the same as it was with versions 2.9 and earlier (although the
> implementation is now quite different).
>

Actually, it might not be exactly the same (but I've also updated many
.lang files).  Previously, it all relied on a big regular expression
with many alternatives, and so it relied on the regular expression
machine to select the most appropriate one, which I don't think it's the
same of the current one.  However, the current one (which is a manual
implementation of the selection of the right regular expression) is
targeted to programming languages (the fact that the prefix only
contains spaces).

cheers
        Lorenzo

--
Lorenzo Bettini, PhD in Computer Science, DI, Univ. Torino
ICQ# lbetto, 16080134     (GNU/Linux User # 158233)
HOME: http://www.lorenzobettini.it MUSIC: http://www.purplesucker.com
http://www.myspace.com/supertrouperabba
BLOGS: http://tronprog.blogspot.com  http://longlivemusic.blogspot.com
http://www.gnu.org/software/src-highlite
http://www.gnu.org/software/gengetopt
http://www.gnu.org/software/gengen http://doublecpp.sourceforge.net


_______________________________________________
Help-source-highlight mailing list
Help-source-highlight@...
http://lists.gnu.org/mailman/listinfo/help-source-highlight