[
https://issues.apache.org/jira/browse/TIKA-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778137#action_12778137 ]
Luke Nezda commented on TIKA-322:
---------------------------------
http://code.google.com/p/juniversalchardet/ has a pretty good, efficient charset decoder which is a Java port of the Mozilla universalchardet algorithms. It is licensed under Mozilla Public License Version 1.1. I am not sure if MPL is ASF compatible; it appears to be, but ianal. afaik, it does not provide detection confidence or language detection features ICU4J does and I think it has code/data files for less encodings, but it is primarily statistical so they could be added. I am also not sure what choices were made with regard to multiple encodings. In theory, it should detect what Firefox detects for a given URL/file.
> Improve encoding detection speed and accuracy
> ---------------------------------------------
>
> Key: TIKA-322
> URL:
https://issues.apache.org/jira/browse/TIKA-322> Project: Tika
> Issue Type: Improvement
> Components: mime
> Reporter: Jukka Zitting
> Priority: Minor
>
> The encoding detection code we took from ICU4J is not very efficient and sometimes produces odd results when more than one encoding matches the given input data. It would be good to refactor the code to be faster for easy-to-detect encodings and to have better heuristics in case multiple matches are found.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.