[jira] Created: (TIKA-322) Improve encoding detection speed and accuracy

View: New views
2 Messages — Rating Filter:   Alert me  

[jira] Created: (TIKA-322) Improve encoding detection speed and accuracy

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Improve encoding detection speed and accuracy
---------------------------------------------

                 Key: TIKA-322
                 URL: https://issues.apache.org/jira/browse/TIKA-322
             Project: Tika
          Issue Type: Improvement
          Components: mime
            Reporter: Jukka Zitting
            Priority: Minor


The encoding detection code we took from ICU4J is not very efficient and sometimes produces odd results when more than one encoding matches the given input data. It would be good to refactor the code to be faster for easy-to-detect encodings and to have better heuristics in case multiple matches are found.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-322) Improve encoding detection speed and accuracy

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778137#action_12778137 ]

Luke Nezda commented on TIKA-322:
---------------------------------

http://code.google.com/p/juniversalchardet/ has a pretty good, efficient charset decoder which is a Java port of the Mozilla universalchardet algorithms. It is licensed under Mozilla Public License Version 1.1.  I am not sure if MPL is ASF compatible; it appears to be, but ianal.  afaik, it does not provide detection confidence or language detection features ICU4J does and I think it has code/data files for less encodings, but it is primarily statistical so they could be added.  I am also not sure what choices were made with regard to multiple encodings.  In theory, it should detect what Firefox detects for a given URL/file.

> Improve encoding detection speed and accuracy
> ---------------------------------------------
>
>                 Key: TIKA-322
>                 URL: https://issues.apache.org/jira/browse/TIKA-322
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> The encoding detection code we took from ICU4J is not very efficient and sometimes produces odd results when more than one encoding matches the given input data. It would be good to refactor the code to be faster for easy-to-detect encodings and to have better heuristics in case multiple matches are found.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.