Nice, thanks for sharing! You observed the same speed increase pattern after running this several times to avoid any cold/hot cache side-effects?
Otis
--
Sematext --
http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Jukka Zitting <
jukka.zitting@...>
> To:
tika-dev@...
> Sent: Wednesday, June 3, 2009 6:18:02 AM
> Subject: Major speed improvements in package parsing
>
> Hi,
>
> Inspired by TIKA-236, I ran the following ad-hoc test:
>
> $ time java -jar tika-0.3-standalone.jar --text lucene-2.0.0-src.zip >
> output-0.3.txt
> real 0m29.844s
> user 0m39.686s
> sys 0m0.840s
> $ time java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.0.0-src.zip
> > output-0.4.txt
> real 0m12.587s
> user 0m15.911s
> sys 0m0.495s
>
> This is especially impressive as the 0.4 version is able to extract
> almost twice as much text from the archive:
>
> $ du -h output-*
> 6.8M output-0.3.txt
> 13M output-0.4.txt
>
> This speed increase is mostly the result of the TIKA-204 and TIKA-238
> improvements.
>
> Looking deeper at the output reveals some minor issues that I'll be
> filing bugs for. However, in general the result of the extraction
> seems pretty good.
>
> BR,
>
> Jukka Zitting