parsing time

View: New views
4 Messages — Rating Filter:   Alert me  

parsing time

by luther.baker :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I am writing a simplistic parser for a small Java web framework.

I have a ServletFilter that creates a factory, loads a resource and parses it for every request.

Initially, I found that if I create an XMLInputFactory2 factory on each request - and then create an XMLStreamReader2, the first invocation of XMLStreamReader2.next() took an inordinate amount of time. Subsequent invocations of .next() were just fine for the opened document.

I found an optimization page for Woodstox implying that if managed correctly, I wouldn't have to create a new factory on every request - so I moved that to a static initialization block and left only with the creation of XMLStreamReader2 on every request.

I fired up the web application and again, the very first XMLStreamReader2.next() took a long time, but subsequent iterations over that resource were fast.

Here's the rub - 'most' of the time, subsequent HttpRequests do not get delayed on their initial XMLStreamReader2.next() invocation ... but some do! If I stop using the app for a few seconds and hit refresh, at randomish times, I can see a significant pause when the code reaches the initial invocation of XMLStreamReader2.next().

If it is helpful, in the filter, I am mapping the URL to a location in the classpath (src/main/resources) and opening the file via Class.getResourceAsStream(). I am passing the resulting InputStream directly to the XMLStreamReader2. The first .next() invocation after an application restart (using Jetty) always takes a long time ... but subsequent .next() invocations are much better (see below) ... as are most subsequent requests. Unfortunately, it seems unpredictable and sometimes, the first initial invocation of .next() on subsequent again, takes awhile.

Thoughts?

-Luther


// STATIC INITIALIZER

final static XMLInputFactory2 factory;
static
{
        // http://www.vsj.co.uk/articles/display.asp?id=643
        // http://www.xml.com/pub/a/2003/09/17/stax.html
        // http://www.cowtowncoder.com/blog/archives/2006/06/entry_2.html
        factory = (XMLInputFactory2) XMLInputFactory.newInstance();
        factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.TRUE);
        factory.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, Boolean.TRUE);
        factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
        factory.configureForSpeed();
}


// PER REQUEST PROCESSING

public void process()
{
        try
        {
            final XMLStreamReader2 reader = (XMLStreamReader2) factory.createXMLStreamReader(inputStream);
            while (reader.hasNext())
            {
                 long start = System.nanoTime();
                 int eventType = reader.next();
                 long stop = System.nanoTime();


// MINIMAL PROFILING

    [reader.next(),html: 4070941 us]               // invocation of reader.next() with an html tag
    [reader.hasNext(): 15 us]
    [reader.next(): 3 us]                                   // invocation of reader.next() with a space or something I'm not watching
    [reader.hasNext(): 0 us]
    [reader.next(),head: 20 us]                        // invocation of reader.next() with a head tag
    [reader.hasNext(): 10 us]
    [reader.next(): 2 us]
    [reader.hasNext(): 0 us]
    [reader.next(),meta: 24 us]                          // invocation of reader.next() with a meta tag

Re: parsing time

by Cowtowncoder :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Apr 26, 2009 at 8:54 AM, Luther Baker <lutherbaker@...> wrote:
> I am writing a simplistic parser for a small Java web framework.
>
> I have a ServletFilter that creates a factory, loads a resource and parses
> it for every request.

Ok.

> Initially, I found that if I create an XMLInputFactory2 factory on each
> request - and then create an XMLStreamReader2, the first invocation of
> XMLStreamReader2.next() took an inordinate amount of time. Subsequent
> invocations of .next() were just fine for the opened document.
>
> I found an optimization page for Woodstox implying that if managed
> correctly, I wouldn't have to create a new factory on every request - so I
> moved that to a static initialization block and left only with the creation
> of XMLStreamReader2 on every request.

Yes, this makes sense. However, I suspect that your issue is not this, since
overhead associated should be much less obvious.

...
> If it is helpful, in the filter, I am mapping the URL to a location in the
> classpath (src/main/resources) and opening the file via
> Class.getResourceAsStream(). I am passing the resulting InputStream directly
> to the XMLStreamReader2. The first .next() invocation after an application
> restart (using Jetty) always takes a long time ... but subsequent .next()
> invocations are much better (see below) ... as are most subsequent requests.
> Unfortunately, it seems unpredictable and sometimes, the first initial
> invocation of .next() on subsequent again, takes awhile.

Reading from a Jar can be very slow, because jars are just zip archives.
Especially if contents are compressed. What JDK code does is (I think)
that it reads zip directory, which takes a while, but keeps it around.
However: any access after that still has to deal with somewhat
serial access to contents within archive.
So reading from a jar is bound to be slow, and especially so for initial
access. If you can test the hypothesis out -- for example, by just
exploding contents to file system, read from there, see how different
timings are -- I would start there.

Hope this helps,

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email



Re: parsing time

by luther.baker :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ah ... we'll see if that's the problem.

I am building this in Eclipse with the m2Eclipse plugin and using the maven JETTY plugin "run" target .... and as documented, that target doesn't actually bundle anything up. It is (theoretically) reading the physical class and resource directories that Eclipse build and copies files to, in my case, project/target/classes.

For what it's worth, I can load and read the files in a similar way with Java 1.6s native SAX parser and I do NOT experience the same delay on the initial read .... but at the very least, you suggestion gives me something else to confirm/test.

After that, I guess my next step will be to attach the Woodstox source and see specifically what the delay is related to.

Thanks Tatu.

-Luther


On Mon, Apr 27, 2009 at 12:40 PM, Tatu Saloranta <tsaloranta@...> wrote:
On Sun, Apr 26, 2009 at 8:54 AM, Luther Baker <lutherbaker@...> wrote:
> I am writing a simplistic parser for a small Java web framework.
>
> I have a ServletFilter that creates a factory, loads a resource and parses
> it for every request.

Ok.

> Initially, I found that if I create an XMLInputFactory2 factory on each
> request - and then create an XMLStreamReader2, the first invocation of
> XMLStreamReader2.next() took an inordinate amount of time. Subsequent
> invocations of .next() were just fine for the opened document.
>
> I found an optimization page for Woodstox implying that if managed
> correctly, I wouldn't have to create a new factory on every request - so I
> moved that to a static initialization block and left only with the creation
> of XMLStreamReader2 on every request.

Yes, this makes sense. However, I suspect that your issue is not this, since
overhead associated should be much less obvious.

...
> If it is helpful, in the filter, I am mapping the URL to a location in the
> classpath (src/main/resources) and opening the file via
> Class.getResourceAsStream(). I am passing the resulting InputStream directly
> to the XMLStreamReader2. The first .next() invocation after an application
> restart (using Jetty) always takes a long time ... but subsequent .next()
> invocations are much better (see below) ... as are most subsequent requests.
> Unfortunately, it seems unpredictable and sometimes, the first initial
> invocation of .next() on subsequent again, takes awhile.

Reading from a Jar can be very slow, because jars are just zip archives.
Especially if contents are compressed. What JDK code does is (I think)
that it reads zip directory, which takes a while, but keeps it around.
However: any access after that still has to deal with somewhat
serial access to contents within archive.
So reading from a jar is bound to be slow, and especially so for initial
access. If you can test the hypothesis out -- for example, by just
exploding contents to file system, read from there, see how different
timings are -- I would start there.

Hope this helps,

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

   http://xircles.codehaus.org/manage_email




Re: parsing time

by Cowtowncoder :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Apr 27, 2009 at 10:54 AM, Luther Baker <lutherbaker@...> wrote:
> Ah ... we'll see if that's the problem.
>
> I am building this in Eclipse with the m2Eclipse plugin and using the maven
> JETTY plugin "run" target .... and as documented, that target doesn't
> actually bundle anything up. It is (theoretically) reading the physical
> class and resource directories that Eclipse build and copies files to, in my
> case, project/target/classes.

Ok.

> For what it's worth, I can load and read the files in a similar way with
> Java 1.6s native SAX parser and I do NOT experience the same delay on the
> initial read .... but at the very least, you suggestion gives me something
> else to confirm/test.
>
> After that, I guess my next step will be to attach the Woodstox source and
> see specifically what the delay is related to.

Yes, that would be helpful. I can not think of anything obvious where
there should be huge start-up cost within parsing. Although there are
costs for not reusing factory, it should mostly just mean that no
buffers are reused, nor symbols, which has incremental effect, but
nothing like order of magnitude.

And any JIT/HotSpot optimization effects should be similar for
in-built parser too. Although... theoretically it could be that
in-built parser is used for something else earlier, to get it "warm
up". But this doesn't seem very likely to explain slowness.

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email