« Return to Thread: Major speed improvements in package parsing

Major speed improvements in package parsing

by Jukka Zitting :: Rate this Message:

Reply to Author | View in Thread

Hi,

Inspired by TIKA-236, I ran the following ad-hoc test:

$ time java -jar tika-0.3-standalone.jar --text lucene-2.0.0-src.zip >
output-0.3.txt
real 0m29.844s
user 0m39.686s
sys 0m0.840s
$ time java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.0.0-src.zip
> output-0.4.txt
real 0m12.587s
user 0m15.911s
sys 0m0.495s

This is especially impressive as the 0.4 version is able to extract
almost twice as much text from the archive:

$ du -h output-*
6.8M output-0.3.txt
13M output-0.4.txt

This speed increase is mostly the result of the TIKA-204 and TIKA-238
improvements.

Looking deeper at the output reveals some minor issues that I'll be
filing bugs for. However, in general the result of the extraction
seems pretty good.

BR,

Jukka Zitting

 « Return to Thread: Major speed improvements in package parsing