BeautifulSoup is very slow with Jython

View: New views
9 Messages — Rating Filter:   Alert me  

BeautifulSoup is very slow with Jython

by gooli :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I
was amazed to see how much slower it was than CPython (2.6). Parsing a
page (http://www.fixprotocol.org/specifications/fields/5000-5999) with
CPython took just under a second (0.844 second to be exact). With
Jython it took 564 seconds - almost 700 times as much.

Can anyone confirm this result? It's doesn't seem reasonable for
Jython to run 700 times slower than CPython. Perhaps something is
wrong with my setup.

Here's the code I used:

import time
from BeautifulSoup import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

---
gooli

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

Re: BeautifulSoup is very slow with Jython

by Sébastien Boisgérault-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eli Golovinsky a écrit :

> Hi,
>
> I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I
> was amazed to see how much slower it was than CPython (2.6). Parsing a
> page (http://www.fixprotocol.org/specifications/fields/5000-5999) with
> CPython took just under a second (0.844 second to be exact). With
> Jython it took 564 seconds - almost 700 times as much.
>
> Can anyone confirm this result? It's doesn't seem reasonable for
> Jython to run 700 times slower than CPython.
CPython is about x380 faster on my box.

ouch ...

SB

> Perhaps something is
> wrong with my setup.
>
> Here's the code I used:
>
> import time
> from BeautifulSoup import BeautifulSoup
> data = open("fix-5000-5999.html").read()
> start = time.time()
> soup = BeautifulSoup(data)
> print time.time() - start
>
> ---
> gooli
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Jython-users mailing list
> Jython-users@...
> https://lists.sourceforge.net/lists/listinfo/jython-users
>
>  


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

Re: BeautifulSoup is very slow with Jython

by Sébastien Boisgérault-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sébastien Boisgérault a écrit :
Eli Golovinsky a écrit :
  
Hi,

I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I
was amazed to see how much slower it was than CPython (2.6). Parsing a
page (http://www.fixprotocol.org/specifications/fields/5000-5999) with
CPython took just under a second (0.844 second to be exact). With
Jython it took 564 seconds - almost 700 times as much.

Can anyone confirm this result? It's doesn't seem reasonable for
Jython to run 700 times slower than CPython. 
    
CPython is about x380 faster on my box.

ouch ...

SB

  
Attached below the execution profiles with CPython and Jython.

AFAICT BeautifulSoup code performs OK with Jython (a few seconds tops spent in handle_* methods), but the HTMLParser code (goahead, parse_* methods) it calls is painfully slow.



CPYTHON

Tue Nov  3 14:24:46 2009    results

         903568 function calls (903519 primitive calls) in 6.512 CPU seconds

   Ordered by: cumulative time
   List reduced from 137 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.512    6.512 profile:0(BeautifulSoup(data))
        1    0.000    0.000    6.512    6.512 <string>:1(<module>)
        1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1164(__init__)
        1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1236(_feed)
        1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1495(__init__)
        1    0.000    0.000    6.484    6.484 HTMLParser.py:101(feed)
        1    0.600    0.600    6.484    6.484 HTMLParser.py:132(goahead)
    11083    0.556    0.000    3.784    0.000 HTMLParser.py:224(parse_starttag)
    11083    0.060    0.000    2.552    0.000 BeautifulSoup.py:1013(handle_starttag)
    11083    0.264    0.000    2.492    0.000 BeautifulSoup.py:1397(unknown_starttag)
     8351    0.240    0.000    1.404    0.000 HTMLParser.py:305(parse_endtag)
     8351    0.060    0.000    1.044    0.000 BeautifulSoup.py:1019(handle_endtag)
     8351    0.120    0.000    0.984    0.000 BeautifulSoup.py:1427(unknown_endtag)
     8349    0.484    0.000    0.912    0.000 BeautifulSoup.py:1351(_smartPop)
    11084    0.168    0.000    0.676    0.000 BeautifulSoup.py:500(__init__)
    12420    0.276    0.000    0.604    0.000 BeautifulSoup.py:1329(_popToTag)
    19441    0.216    0.000    0.528    0.000 BeautifulSoup.py:1306(endData)
    33250    0.244    0.000    0.368    0.000 BeautifulSoup.py:1269(isSelfClosingTag)
    66134    0.344    0.000    0.344    0.000 :0(match)
   103566    0.280    0.000    0.280    0.000 BeautifulSoup.py:554(__nonzero__)
JYTHON
Tue Nov  3 14:31:34 2009    results

         383982 function calls (383944 primitive calls) in 390.007 CPU seconds

   Ordered by: cumulative time
   List reduced from 97 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.003    0.003  390.007  390.007 profile:0(BeautifulSoup(data))
        1    0.000    0.000  390.004  390.004 <string>:0(<module>)
        1    0.000    0.000  390.004  390.004 BeautifulSoup.py:1495(__init__)
        1    0.000    0.000  390.004  390.004 BeautifulSoup.py:1164(__init__)
        1    0.372    0.372  390.003  390.003 BeautifulSoup.py:1236(_feed)
        1    0.000    0.000  389.553  389.553 HTMLParser.py:101(feed)
        1  159.714  159.714  389.552  389.552 HTMLParser.py:132(goahead)
    11083  112.086    0.010  159.921    0.014 HTMLParser.py:224(parse_starttag)
     8351   68.361    0.008   69.394    0.008 HTMLParser.py:305(parse_endtag)
    11083   45.443    0.004   45.443    0.004 HTMLParser.py:275(check_for_whole_start_tag)
    11083    0.084    0.000    2.363    0.000 BeautifulSoup.py:1013(handle_starttag)
    11083    0.536    0.000    2.278    0.000 BeautifulSoup.py:1397(unknown_starttag)
     8351    0.051    0.000    1.009    0.000 BeautifulSoup.py:1019(handle_endtag)
     8351    0.077    0.000    0.958    0.000 BeautifulSoup.py:1427(unknown_endtag)
    19441    0.438    0.000    0.761    0.000 BeautifulSoup.py:1306(endData)
    11084    0.374    0.000    0.726    0.000 BeautifulSoup.py:500(__init__)
     8349    0.498    0.000    0.630    0.000 BeautifulSoup.py:1351(_smartPop)
    38924    0.435    0.000    0.435    0.000 markupbase.py:49(updatepos)
    12420    0.251    0.000    0.334    0.000 BeautifulSoup.py:1329(_popToTag)
    17252    0.220    0.000    0.245    0.000 BeautifulSoup.py:118(setup)





  
Perhaps something is
wrong with my setup.

Here's the code I used:

import time
from BeautifulSoup import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

---
gooli

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

  
    


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

  


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

Re: BeautifulSoup is very slow with Jython

by Jim Baker :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At least the problem is in a very small part of the code (which seems to be the usual case for something so bad). The goahead, parse_starttag, and parse_endtag methods in HTMLParser all use regexes extensively, so that would be my first guess. Our regex implementation is a direct port of CPython's, but it's certainly possible that we have not applied the same subsequent performance optimizations for support of such things as lookahead.

So now we need to profile a little deeper with something like YourKit to see what's really happening.

2009/11/3 Sébastien Boisgérault <Sebastien.Boisgerault@...>
Sébastien Boisgérault a écrit :
Eli Golovinsky a écrit :
  
Hi,

I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I
was amazed to see how much slower it was than CPython (2.6). Parsing a
page (http://www.fixprotocol.org/specifications/fields/5000-5999) with
CPython took just under a second (0.844 second to be exact). With
Jython it took 564 seconds - almost 700 times as much.

Can anyone confirm this result? It's doesn't seem reasonable for
Jython to run 700 times slower than CPython. 
    
CPython is about x380 faster on my box.

ouch ...

SB

  
Attached below the execution profiles with CPython and Jython.

AFAICT BeautifulSoup code performs OK with Jython (a few seconds tops spent in handle_* methods), but the HTMLParser code (goahead, parse_* methods) it calls is painfully slow.



CPYTHON

Tue Nov  3 14:24:46 2009    results

         903568 function calls (903519 primitive calls) in 6.512 CPU seconds

   Ordered by: cumulative time
   List reduced from 137 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.512    6.512 profile:0(BeautifulSoup(data))
        1    0.000    0.000    6.512    6.512 <string>:1(<module>)
        1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1164(__init__)
        1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1236(_feed)
        1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1495(__init__)
        1    0.000    0.000    6.484    6.484 HTMLParser.py:101(feed)
        1    0.600    0.600    6.484    6.484 HTMLParser.py:132(goahead)
    11083    0.556    0.000    3.784    0.000 HTMLParser.py:224(parse_starttag)
    11083    0.060    0.000    2.552    0.000 BeautifulSoup.py:1013(handle_starttag)
    11083    0.264    0.000    2.492    0.000 BeautifulSoup.py:1397(unknown_starttag)
     8351    0.240    0.000    1.404    0.000 HTMLParser.py:305(parse_endtag)
     8351    0.060    0.000    1.044    0.000 BeautifulSoup.py:1019(handle_endtag)
     8351    0.120    0.000    0.984    0.000 BeautifulSoup.py:1427(unknown_endtag)
     8349    0.484    0.000    0.912    0.000 BeautifulSoup.py:1351(_smartPop)
    11084    0.168    0.000    0.676    0.000 BeautifulSoup.py:500(__init__)
    12420    0.276    0.000    0.604    0.000 BeautifulSoup.py:1329(_popToTag)
    19441    0.216    0.000    0.528    0.000 BeautifulSoup.py:1306(endData)
    33250    0.244    0.000    0.368    0.000 BeautifulSoup.py:1269(isSelfClosingTag)
    66134    0.344    0.000    0.344    0.000 :0(match)
   103566    0.280    0.000    0.280    0.000 BeautifulSoup.py:554(__nonzero__)
JYTHON
Tue Nov  3 14:31:34 2009    results

         383982 function calls (383944 primitive calls) in 390.007 CPU seconds

   Ordered by: cumulative time
   List reduced from 97 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.003    0.003  390.007  390.007 profile:0(BeautifulSoup(data))
        1    0.000    0.000  390.004  390.004 <string>:0(<module>)
        1    0.000    0.000  390.004  390.004 BeautifulSoup.py:1495(__init__)
        1    0.000    0.000  390.004  390.004 BeautifulSoup.py:1164(__init__)
        1    0.372    0.372  390.003  390.003 BeautifulSoup.py:1236(_feed)
        1    0.000    0.000  389.553  389.553 HTMLParser.py:101(feed)
        1  159.714  159.714  389.552  389.552 HTMLParser.py:132(goahead)
    11083  112.086    0.010  159.921    0.014 HTMLParser.py:224(parse_starttag)
     8351   68.361    0.008   69.394    0.008 HTMLParser.py:305(parse_endtag)
    11083   45.443    0.004   45.443    0.004 HTMLParser.py:275(check_for_whole_start_tag)
    11083    0.084    0.000    2.363    0.000 BeautifulSoup.py:1013(handle_starttag)
    11083    0.536    0.000    2.278    0.000 BeautifulSoup.py:1397(unknown_starttag)
     8351    0.051    0.000    1.009    0.000 BeautifulSoup.py:1019(handle_endtag)
     8351    0.077    0.000    0.958    0.000 BeautifulSoup.py:1427(unknown_endtag)
    19441    0.438    0.000    0.761    0.000 BeautifulSoup.py:1306(endData)
    11084    0.374    0.000    0.726    0.000 BeautifulSoup.py:500(__init__)
     8349    0.498    0.000    0.630    0.000 BeautifulSoup.py:1351(_smartPop)
    38924    0.435    0.000    0.435    0.000 markupbase.py:49(updatepos)
    12420    0.251    0.000    0.334    0.000 BeautifulSoup.py:1329(_popToTag)
    17252    0.220    0.000    0.245    0.000 BeautifulSoup.py:118(setup)





  
Perhaps something is
wrong with my setup.

Here's the code I used:

import time
from BeautifulSoup import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

---
gooli

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

  
    
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

  


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users




--
Jim Baker
jbaker@...

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

Re: BeautifulSoup is very slow with Jython

by gooli :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Isn't CPython regex implementation in C? Couldn't Jython's regex
implementation use the Java regular expression engine after (possibly)
some simple translation from Python syntax to Java syntax?

---
gooli



2009/11/3 Jim Baker <jbaker@...>:

> At least the problem is in a very small part of the code (which seems to be
> the usual case for something so bad). The goahead, parse_starttag, and
> parse_endtag methods in HTMLParser all use regexes extensively, so that
> would be my first guess. Our regex implementation is a direct port of
> CPython's, but it's certainly possible that we have not applied the same
> subsequent performance optimizations for support of such things as
> lookahead.
>
> So now we need to profile a little deeper with something like YourKit to see
> what's really happening.
>
> 2009/11/3 Sébastien Boisgérault <Sebastien.Boisgerault@...>
>>
>> Sébastien Boisgérault a écrit :
>>
>> Eli Golovinsky a écrit :
>>
>>
>> Hi,
>>
>> I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I
>> was amazed to see how much slower it was than CPython (2.6). Parsing a
>> page (http://www.fixprotocol.org/specifications/fields/5000-5999) with
>> CPython took just under a second (0.844 second to be exact). With
>> Jython it took 564 seconds - almost 700 times as much.
>>
>> Can anyone confirm this result? It's doesn't seem reasonable for
>> Jython to run 700 times slower than CPython.
>>
>>
>> CPython is about x380 faster on my box.
>>
>> ouch ...
>>
>> SB
>>
>>
>>
>> Attached below the execution profiles with CPython and Jython.
>>
>> AFAICT BeautifulSoup code performs OK with Jython (a few seconds tops
>> spent in handle_* methods), but the HTMLParser code (goahead, parse_*
>> methods) it calls is painfully slow.
>>
>>
>>
>> CPYTHON
>>
>> Tue Nov  3 14:24:46 2009    results
>>
>>          903568 function calls (903519 primitive calls) in 6.512 CPU
>> seconds
>>
>>    Ordered by: cumulative time
>>    List reduced from 137 to 20 due to restriction <20>
>>
>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>         1    0.000    0.000    6.512    6.512
>> profile:0(BeautifulSoup(data))
>>         1    0.000    0.000    6.512    6.512 <string>:1(<module>)
>>         1    0.000    0.000    6.512    6.512
>> BeautifulSoup.py:1164(__init__)
>>         1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1236(_feed)
>>         1    0.000    0.000    6.512    6.512
>> BeautifulSoup.py:1495(__init__)
>>         1    0.000    0.000    6.484    6.484 HTMLParser.py:101(feed)
>>         1    0.600    0.600    6.484    6.484 HTMLParser.py:132(goahead)
>>     11083    0.556    0.000    3.784    0.000
>> HTMLParser.py:224(parse_starttag)
>>     11083    0.060    0.000    2.552    0.000
>> BeautifulSoup.py:1013(handle_starttag)
>>     11083    0.264    0.000    2.492    0.000
>> BeautifulSoup.py:1397(unknown_starttag)
>>      8351    0.240    0.000    1.404    0.000
>> HTMLParser.py:305(parse_endtag)
>>      8351    0.060    0.000    1.044    0.000
>> BeautifulSoup.py:1019(handle_endtag)
>>      8351    0.120    0.000    0.984    0.000
>> BeautifulSoup.py:1427(unknown_endtag)
>>      8349    0.484    0.000    0.912    0.000
>> BeautifulSoup.py:1351(_smartPop)
>>     11084    0.168    0.000    0.676    0.000
>> BeautifulSoup.py:500(__init__)
>>     12420    0.276    0.000    0.604    0.000
>> BeautifulSoup.py:1329(_popToTag)
>>     19441    0.216    0.000    0.528    0.000
>> BeautifulSoup.py:1306(endData)
>>     33250    0.244    0.000    0.368    0.000
>> BeautifulSoup.py:1269(isSelfClosingTag)
>>     66134    0.344    0.000    0.344    0.000 :0(match)
>>    103566    0.280    0.000    0.280    0.000
>> BeautifulSoup.py:554(__nonzero__)
>>
>> JYTHON
>>
>> Tue Nov  3 14:31:34 2009    results
>>
>>          383982 function calls (383944 primitive calls) in 390.007 CPU
>> seconds
>>
>>    Ordered by: cumulative time
>>    List reduced from 97 to 20 due to restriction <20>
>>
>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>         1    0.003    0.003  390.007  390.007
>> profile:0(BeautifulSoup(data))
>>         1    0.000    0.000  390.004  390.004 <string>:0(<module>)
>>         1    0.000    0.000  390.004  390.004
>> BeautifulSoup.py:1495(__init__)
>>         1    0.000    0.000  390.004  390.004
>> BeautifulSoup.py:1164(__init__)
>>         1    0.372    0.372  390.003  390.003 BeautifulSoup.py:1236(_feed)
>>         1    0.000    0.000  389.553  389.553 HTMLParser.py:101(feed)
>>         1  159.714  159.714  389.552  389.552 HTMLParser.py:132(goahead)
>>     11083  112.086    0.010  159.921    0.014
>> HTMLParser.py:224(parse_starttag)
>>      8351   68.361    0.008   69.394    0.008
>> HTMLParser.py:305(parse_endtag)
>>     11083   45.443    0.004   45.443    0.004
>> HTMLParser.py:275(check_for_whole_start_tag)
>>     11083    0.084    0.000    2.363    0.000
>> BeautifulSoup.py:1013(handle_starttag)
>>     11083    0.536    0.000    2.278    0.000
>> BeautifulSoup.py:1397(unknown_starttag)
>>      8351    0.051    0.000    1.009    0.000
>> BeautifulSoup.py:1019(handle_endtag)
>>      8351    0.077    0.000    0.958    0.000
>> BeautifulSoup.py:1427(unknown_endtag)
>>     19441    0.438    0.000    0.761    0.000
>> BeautifulSoup.py:1306(endData)
>>     11084    0.374    0.000    0.726    0.000
>> BeautifulSoup.py:500(__init__)
>>      8349    0.498    0.000    0.630    0.000
>> BeautifulSoup.py:1351(_smartPop)
>>     38924    0.435    0.000    0.435    0.000 markupbase.py:49(updatepos)
>>     12420    0.251    0.000    0.334    0.000
>> BeautifulSoup.py:1329(_popToTag)
>>     17252    0.220    0.000    0.245    0.000 BeautifulSoup.py:118(setup)
>>
>>
>>
>>
>> Perhaps something is
>> wrong with my setup.
>>
>> Here's the code I used:
>>
>> import time
>> from BeautifulSoup import BeautifulSoup
>> data = open("fix-5000-5999.html").read()
>> start = time.time()
>> soup = BeautifulSoup(data)
>> print time.time() - start
>>
>> ---
>> gooli
>>
>>
>> ------------------------------------------------------------------------------
>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>> is the only developer event you need to attend this year. Jumpstart your
>> developing skills, take BlackBerry mobile applications to market and stay
>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>> http://p.sf.net/sfu/devconference
>> _______________________________________________
>> Jython-users mailing list
>> Jython-users@...
>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>> is the only developer event you need to attend this year. Jumpstart your
>> developing skills, take BlackBerry mobile applications to market and stay
>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>> http://p.sf.net/sfu/devconference
>> _______________________________________________
>> Jython-users mailing list
>> Jython-users@...
>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>> is the only developer event you need to attend this year. Jumpstart your
>> developing skills, take BlackBerry mobile applications to market and stay
>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>> http://p.sf.net/sfu/devconference
>> _______________________________________________
>> Jython-users mailing list
>> Jython-users@...
>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>
>
>
>
> --
> Jim Baker
> jbaker@...
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

Re: BeautifulSoup is very slow with Jython

by Alex Grönholm-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eli Golovinsky kirjoitti:
> Isn't CPython regex implementation in C? Couldn't Jython's regex
> implementation use the Java regular expression engine after (possibly)
> some simple translation from Python syntax to Java syntax?
>
>  
Java's regular expressions have different semantics.
Jython's REs are notoriously slow, but iirc it's being worked on and I'd
expect faster REs in the 2.5 line of releases already.

> ---
> gooli
>
>
>
> 2009/11/3 Jim Baker <jbaker@...>:
>  
>> At least the problem is in a very small part of the code (which seems to be
>> the usual case for something so bad). The goahead, parse_starttag, and
>> parse_endtag methods in HTMLParser all use regexes extensively, so that
>> would be my first guess. Our regex implementation is a direct port of
>> CPython's, but it's certainly possible that we have not applied the same
>> subsequent performance optimizations for support of such things as
>> lookahead.
>>
>> So now we need to profile a little deeper with something like YourKit to see
>> what's really happening.
>>
>> 2009/11/3 Sébastien Boisgérault <Sebastien.Boisgerault@...>
>>    
>>> Sébastien Boisgérault a écrit :
>>>
>>> Eli Golovinsky a écrit :
>>>
>>>
>>> Hi,
>>>
>>> I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I
>>> was amazed to see how much slower it was than CPython (2.6). Parsing a
>>> page (http://www.fixprotocol.org/specifications/fields/5000-5999) with
>>> CPython took just under a second (0.844 second to be exact). With
>>> Jython it took 564 seconds - almost 700 times as much.
>>>
>>> Can anyone confirm this result? It's doesn't seem reasonable for
>>> Jython to run 700 times slower than CPython.
>>>
>>>
>>> CPython is about x380 faster on my box.
>>>
>>> ouch ...
>>>
>>> SB
>>>
>>>
>>>
>>> Attached below the execution profiles with CPython and Jython.
>>>
>>> AFAICT BeautifulSoup code performs OK with Jython (a few seconds tops
>>> spent in handle_* methods), but the HTMLParser code (goahead, parse_*
>>> methods) it calls is painfully slow.
>>>
>>>
>>>
>>> CPYTHON
>>>
>>> Tue Nov  3 14:24:46 2009    results
>>>
>>>          903568 function calls (903519 primitive calls) in 6.512 CPU
>>> seconds
>>>
>>>    Ordered by: cumulative time
>>>    List reduced from 137 to 20 due to restriction <20>
>>>
>>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>>         1    0.000    0.000    6.512    6.512
>>> profile:0(BeautifulSoup(data))
>>>         1    0.000    0.000    6.512    6.512 <string>:1(<module>)
>>>         1    0.000    0.000    6.512    6.512
>>> BeautifulSoup.py:1164(__init__)
>>>         1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1236(_feed)
>>>         1    0.000    0.000    6.512    6.512
>>> BeautifulSoup.py:1495(__init__)
>>>         1    0.000    0.000    6.484    6.484 HTMLParser.py:101(feed)
>>>         1    0.600    0.600    6.484    6.484 HTMLParser.py:132(goahead)
>>>     11083    0.556    0.000    3.784    0.000
>>> HTMLParser.py:224(parse_starttag)
>>>     11083    0.060    0.000    2.552    0.000
>>> BeautifulSoup.py:1013(handle_starttag)
>>>     11083    0.264    0.000    2.492    0.000
>>> BeautifulSoup.py:1397(unknown_starttag)
>>>      8351    0.240    0.000    1.404    0.000
>>> HTMLParser.py:305(parse_endtag)
>>>      8351    0.060    0.000    1.044    0.000
>>> BeautifulSoup.py:1019(handle_endtag)
>>>      8351    0.120    0.000    0.984    0.000
>>> BeautifulSoup.py:1427(unknown_endtag)
>>>      8349    0.484    0.000    0.912    0.000
>>> BeautifulSoup.py:1351(_smartPop)
>>>     11084    0.168    0.000    0.676    0.000
>>> BeautifulSoup.py:500(__init__)
>>>     12420    0.276    0.000    0.604    0.000
>>> BeautifulSoup.py:1329(_popToTag)
>>>     19441    0.216    0.000    0.528    0.000
>>> BeautifulSoup.py:1306(endData)
>>>     33250    0.244    0.000    0.368    0.000
>>> BeautifulSoup.py:1269(isSelfClosingTag)
>>>     66134    0.344    0.000    0.344    0.000 :0(match)
>>>    103566    0.280    0.000    0.280    0.000
>>> BeautifulSoup.py:554(__nonzero__)
>>>
>>> JYTHON
>>>
>>> Tue Nov  3 14:31:34 2009    results
>>>
>>>          383982 function calls (383944 primitive calls) in 390.007 CPU
>>> seconds
>>>
>>>    Ordered by: cumulative time
>>>    List reduced from 97 to 20 due to restriction <20>
>>>
>>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>>         1    0.003    0.003  390.007  390.007
>>> profile:0(BeautifulSoup(data))
>>>         1    0.000    0.000  390.004  390.004 <string>:0(<module>)
>>>         1    0.000    0.000  390.004  390.004
>>> BeautifulSoup.py:1495(__init__)
>>>         1    0.000    0.000  390.004  390.004
>>> BeautifulSoup.py:1164(__init__)
>>>         1    0.372    0.372  390.003  390.003 BeautifulSoup.py:1236(_feed)
>>>         1    0.000    0.000  389.553  389.553 HTMLParser.py:101(feed)
>>>         1  159.714  159.714  389.552  389.552 HTMLParser.py:132(goahead)
>>>     11083  112.086    0.010  159.921    0.014
>>> HTMLParser.py:224(parse_starttag)
>>>      8351   68.361    0.008   69.394    0.008
>>> HTMLParser.py:305(parse_endtag)
>>>     11083   45.443    0.004   45.443    0.004
>>> HTMLParser.py:275(check_for_whole_start_tag)
>>>     11083    0.084    0.000    2.363    0.000
>>> BeautifulSoup.py:1013(handle_starttag)
>>>     11083    0.536    0.000    2.278    0.000
>>> BeautifulSoup.py:1397(unknown_starttag)
>>>      8351    0.051    0.000    1.009    0.000
>>> BeautifulSoup.py:1019(handle_endtag)
>>>      8351    0.077    0.000    0.958    0.000
>>> BeautifulSoup.py:1427(unknown_endtag)
>>>     19441    0.438    0.000    0.761    0.000
>>> BeautifulSoup.py:1306(endData)
>>>     11084    0.374    0.000    0.726    0.000
>>> BeautifulSoup.py:500(__init__)
>>>      8349    0.498    0.000    0.630    0.000
>>> BeautifulSoup.py:1351(_smartPop)
>>>     38924    0.435    0.000    0.435    0.000 markupbase.py:49(updatepos)
>>>     12420    0.251    0.000    0.334    0.000
>>> BeautifulSoup.py:1329(_popToTag)
>>>     17252    0.220    0.000    0.245    0.000 BeautifulSoup.py:118(setup)
>>>
>>>
>>>
>>>
>>> Perhaps something is
>>> wrong with my setup.
>>>
>>> Here's the code I used:
>>>
>>> import time
>>> from BeautifulSoup import BeautifulSoup
>>> data = open("fix-5000-5999.html").read()
>>> start = time.time()
>>> soup = BeautifulSoup(data)
>>> print time.time() - start
>>>
>>> ---
>>> gooli
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@...
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@...
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@...
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>      
>>
>> --
>> Jim Baker
>> jbaker@...
>>
>>    
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Jython-users mailing list
> Jython-users@...
> https://lists.sourceforge.net/lists/listinfo/jython-users
>  


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

Re: BeautifulSoup is very slow with Jython

by Jim Baker :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

So we looked at both java.util.regex and somewhat more seriously at the JRuby regex engine (itself a port of a C-based engine). In the latter case, Nicholas Riley saw something like 30% (as I recall it) speedup over our existing engine.

But the semantics of the last "10%" makes it difficult.

The reality is that CPython's sre is quite good. We just need to put some more resources in finetuning our port, as well as fixing any gaping holes in performance, as seems to be suggested by what we are seeing with HTMLParser. In terms of the finetuning, we know of several opportunities to speed things up, it's just a question of spending some time on them.

- Jim

2009/11/3 Alex Grönholm <alex.gronholm@...>
Eli Golovinsky kirjoitti:
> Isn't CPython regex implementation in C? Couldn't Jython's regex
> implementation use the Java regular expression engine after (possibly)
> some simple translation from Python syntax to Java syntax?
>
>
Java's regular expressions have different semantics.
Jython's REs are notoriously slow, but iirc it's being worked on and I'd
expect faster REs in the 2.5 line of releases already.
> ---
> gooli
>
>
>
> 2009/11/3 Jim Baker <jbaker@...>:
>
>> At least the problem is in a very small part of the code (which seems to be
>> the usual case for something so bad). The goahead, parse_starttag, and
>> parse_endtag methods in HTMLParser all use regexes extensively, so that
>> would be my first guess. Our regex implementation is a direct port of
>> CPython's, but it's certainly possible that we have not applied the same
>> subsequent performance optimizations for support of such things as
>> lookahead.
>>
>> So now we need to profile a little deeper with something like YourKit to see
>> what's really happening.
>>
>> 2009/11/3 Sébastien Boisgérault <Sebastien.Boisgerault@...>
>>
>>> Sébastien Boisgérault a écrit :
>>>
>>> Eli Golovinsky a écrit :
>>>
>>>
>>> Hi,
>>>
>>> I just tried to run BeautifulSoup (3.1.0.1) with Jython (2.5.1) and I
>>> was amazed to see how much slower it was than CPython (2.6). Parsing a
>>> page (http://www.fixprotocol.org/specifications/fields/5000-5999) with
>>> CPython took just under a second (0.844 second to be exact). With
>>> Jython it took 564 seconds - almost 700 times as much.
>>>
>>> Can anyone confirm this result? It's doesn't seem reasonable for
>>> Jython to run 700 times slower than CPython.
>>>
>>>
>>> CPython is about x380 faster on my box.
>>>
>>> ouch ...
>>>
>>> SB
>>>
>>>
>>>
>>> Attached below the execution profiles with CPython and Jython.
>>>
>>> AFAICT BeautifulSoup code performs OK with Jython (a few seconds tops
>>> spent in handle_* methods), but the HTMLParser code (goahead, parse_*
>>> methods) it calls is painfully slow.
>>>
>>>
>>>
>>> CPYTHON
>>>
>>> Tue Nov  3 14:24:46 2009    results
>>>
>>>          903568 function calls (903519 primitive calls) in 6.512 CPU
>>> seconds
>>>
>>>    Ordered by: cumulative time
>>>    List reduced from 137 to 20 due to restriction <20>
>>>
>>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>>         1    0.000    0.000    6.512    6.512
>>> profile:0(BeautifulSoup(data))
>>>         1    0.000    0.000    6.512    6.512 <string>:1(<module>)
>>>         1    0.000    0.000    6.512    6.512
>>> BeautifulSoup.py:1164(__init__)
>>>         1    0.000    0.000    6.512    6.512 BeautifulSoup.py:1236(_feed)
>>>         1    0.000    0.000    6.512    6.512
>>> BeautifulSoup.py:1495(__init__)
>>>         1    0.000    0.000    6.484    6.484 HTMLParser.py:101(feed)
>>>         1    0.600    0.600    6.484    6.484 HTMLParser.py:132(goahead)
>>>     11083    0.556    0.000    3.784    0.000
>>> HTMLParser.py:224(parse_starttag)
>>>     11083    0.060    0.000    2.552    0.000
>>> BeautifulSoup.py:1013(handle_starttag)
>>>     11083    0.264    0.000    2.492    0.000
>>> BeautifulSoup.py:1397(unknown_starttag)
>>>      8351    0.240    0.000    1.404    0.000
>>> HTMLParser.py:305(parse_endtag)
>>>      8351    0.060    0.000    1.044    0.000
>>> BeautifulSoup.py:1019(handle_endtag)
>>>      8351    0.120    0.000    0.984    0.000
>>> BeautifulSoup.py:1427(unknown_endtag)
>>>      8349    0.484    0.000    0.912    0.000
>>> BeautifulSoup.py:1351(_smartPop)
>>>     11084    0.168    0.000    0.676    0.000
>>> BeautifulSoup.py:500(__init__)
>>>     12420    0.276    0.000    0.604    0.000
>>> BeautifulSoup.py:1329(_popToTag)
>>>     19441    0.216    0.000    0.528    0.000
>>> BeautifulSoup.py:1306(endData)
>>>     33250    0.244    0.000    0.368    0.000
>>> BeautifulSoup.py:1269(isSelfClosingTag)
>>>     66134    0.344    0.000    0.344    0.000 :0(match)
>>>    103566    0.280    0.000    0.280    0.000
>>> BeautifulSoup.py:554(__nonzero__)
>>>
>>> JYTHON
>>>
>>> Tue Nov  3 14:31:34 2009    results
>>>
>>>          383982 function calls (383944 primitive calls) in 390.007 CPU
>>> seconds
>>>
>>>    Ordered by: cumulative time
>>>    List reduced from 97 to 20 due to restriction <20>
>>>
>>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>>         1    0.003    0.003  390.007  390.007
>>> profile:0(BeautifulSoup(data))
>>>         1    0.000    0.000  390.004  390.004 <string>:0(<module>)
>>>         1    0.000    0.000  390.004  390.004
>>> BeautifulSoup.py:1495(__init__)
>>>         1    0.000    0.000  390.004  390.004
>>> BeautifulSoup.py:1164(__init__)
>>>         1    0.372    0.372  390.003  390.003 BeautifulSoup.py:1236(_feed)
>>>         1    0.000    0.000  389.553  389.553 HTMLParser.py:101(feed)
>>>         1  159.714  159.714  389.552  389.552 HTMLParser.py:132(goahead)
>>>     11083  112.086    0.010  159.921    0.014
>>> HTMLParser.py:224(parse_starttag)
>>>      8351   68.361    0.008   69.394    0.008
>>> HTMLParser.py:305(parse_endtag)
>>>     11083   45.443    0.004   45.443    0.004
>>> HTMLParser.py:275(check_for_whole_start_tag)
>>>     11083    0.084    0.000    2.363    0.000
>>> BeautifulSoup.py:1013(handle_starttag)
>>>     11083    0.536    0.000    2.278    0.000
>>> BeautifulSoup.py:1397(unknown_starttag)
>>>      8351    0.051    0.000    1.009    0.000
>>> BeautifulSoup.py:1019(handle_endtag)
>>>      8351    0.077    0.000    0.958    0.000
>>> BeautifulSoup.py:1427(unknown_endtag)
>>>     19441    0.438    0.000    0.761    0.000
>>> BeautifulSoup.py:1306(endData)
>>>     11084    0.374    0.000    0.726    0.000
>>> BeautifulSoup.py:500(__init__)
>>>      8349    0.498    0.000    0.630    0.000
>>> BeautifulSoup.py:1351(_smartPop)
>>>     38924    0.435    0.000    0.435    0.000 markupbase.py:49(updatepos)
>>>     12420    0.251    0.000    0.334    0.000
>>> BeautifulSoup.py:1329(_popToTag)
>>>     17252    0.220    0.000    0.245    0.000 BeautifulSoup.py:118(setup)
>>>
>>>
>>>
>>>
>>> Perhaps something is
>>> wrong with my setup.
>>>
>>> Here's the code I used:
>>>
>>> import time
>>> from BeautifulSoup import BeautifulSoup
>>> data = open("fix-5000-5999.html").read()
>>> start = time.time()
>>> soup = BeautifulSoup(data)
>>> print time.time() - start
>>>
>>> ---
>>> gooli
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@...
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@...
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Jython-users mailing list
>>> Jython-users@...
>>> https://lists.sourceforge.net/lists/listinfo/jython-users
>>>
>>>
>>
>> --
>> Jim Baker
>> jbaker@...
>>
>>
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Jython-users mailing list
> Jython-users@...
> https://lists.sourceforge.net/lists/listinfo/jython-users
>


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users



--
Jim Baker
jbaker@...

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

Re: BeautifulSoup is very slow with Jython

by Larry Riedel-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

What were the biggest obstacles to implementing re
as a wrapper for java.util.regex?


Larry

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

Re: BeautifulSoup is very slow with Jython

by Nicholas Riley-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

In article
<d03bb4010911030757m64f2e6cag8c00edf2d4b1c349@...>,
 Jim Baker <jbaker@...> wrote:

> So we looked at both java.util.regex and somewhat more seriously at the
> JRuby regex engine (itself a port of a C-based engine). In the latter case,
> Nicholas Riley saw something like 30% (as I recall it) speedup over our
> existing engine.

It was closer to 50% on pyparsing, but that's a pretty extreme case of
regex usage.  One of the major issues is that the string gets converted
into an int array and back in order to handle 4-byte Unicode characters.

One version of my patch is here:

<http://freya.cs.uiuc.edu/~njriley/joni.patch>

It's pretty rough, though.  It may be possible to examine the regex at
compile time to determine whether it's executable with another engine or
needs exact sre emulation.
--
Nicholas Riley <njriley@...>


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users