[jira] Created: (TIKA-320) Allow disabling language detection in AutoDetectParser

View: New views
3 Messages — Rating Filter:   Alert me  

[jira] Created: (TIKA-320) Allow disabling language detection in AutoDetectParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Allow disabling language detection in AutoDetectParser
------------------------------------------------------

                 Key: TIKA-320
                 URL: https://issues.apache.org/jira/browse/TIKA-320
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 0.5
            Reporter: Erik Hetzner


It should be possible to disable language detection in the AutoDetectParser.

Between 0.4 and the current trunk, the time Tika spent parsing my test data (100MB of compressed web crawl data, mixed HTML, images, etc.) increased considerably. After profiling, I determined that most of the time was spent in language detection.

time results of indexing my test data with Lucene using AutoDetectParser:

real 15m21.020s
user 6m31.344s
sys 0m4.556s

time results on the same test data using the same code as AutoDetectParser, but with language detection disabled:

real 4m48.856s
user 2m9.416s
sys 0m3.484s

Obviously these numbers are worthless in their particulars but I think they demonstrate that one ought to be able to turn off language detection, as it can massively slow down parsing.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-320) Allow disabling language detection in AutoDetectParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/TIKA-320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-320.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Good point, I didn't consider the performance impact when adding the language detection to AutoDetectParser.

I've removed the feature from AutoDetectParser in revision 835720. Clients can still add language detection on top of the Parser API if they want, and it's probably best if we don't make it an integral part of the AutoDetectParser before the feature becomes more mature.

> Allow disabling language detection in AutoDetectParser
> ------------------------------------------------------
>
>                 Key: TIKA-320
>                 URL: https://issues.apache.org/jira/browse/TIKA-320
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.5
>            Reporter: Erik Hetzner
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>
> It should be possible to disable language detection in the AutoDetectParser.
> Between 0.4 and the current trunk, the time Tika spent parsing my test data (100MB of compressed web crawl data, mixed HTML, images, etc.) increased considerably. After profiling, I determined that most of the time was spent in language detection.
> time results of indexing my test data with Lucene using AutoDetectParser:
> real 15m21.020s
> user 6m31.344s
> sys 0m4.556s
> time results on the same test data using the same code as AutoDetectParser, but with language detection disabled:
> real 4m48.856s
> user 2m9.416s
> sys 0m3.484s
> Obviously these numbers are worthless in their particulars but I think they demonstrate that one ought to be able to turn off language detection, as it can massively slow down parsing.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-320) Allow disabling language detection in AutoDetectParser

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778483#action_12778483 ]

Erik Hetzner commented on TIKA-320:
-----------------------------------

Wonderful, thanks!

> Allow disabling language detection in AutoDetectParser
> ------------------------------------------------------
>
>                 Key: TIKA-320
>                 URL: https://issues.apache.org/jira/browse/TIKA-320
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.5
>            Reporter: Erik Hetzner
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>
> It should be possible to disable language detection in the AutoDetectParser.
> Between 0.4 and the current trunk, the time Tika spent parsing my test data (100MB of compressed web crawl data, mixed HTML, images, etc.) increased considerably. After profiling, I determined that most of the time was spent in language detection.
> time results of indexing my test data with Lucene using AutoDetectParser:
> real 15m21.020s
> user 6m31.344s
> sys 0m4.556s
> time results on the same test data using the same code as AutoDetectParser, but with language detection disabled:
> real 4m48.856s
> user 2m9.416s
> sys 0m3.484s
> Obviously these numbers are worthless in their particulars but I think they demonstrate that one ought to be able to turn off language detection, as it can massively slow down parsing.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.