|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (TIKA-209) Language detection is weak.Language detection is weak.
--------------------------- Key: TIKA-209 URL: https://issues.apache.org/jira/browse/TIKA-209 Project: Tika Issue Type: Bug Affects Versions: 0.3 Reporter: Robert Newson in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. Please add a configurable level (0-100); if (language != null && match.getConfidence() > THRESHOLD) { metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); metadata.set(Metadata.LANGUAGE, match.getLanguage()); } Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688660#action_12688660 ] Robert Newson commented on TIKA-209: ------------------------------------ FYI: In my project (couchdb-lucene) I've pulled in the ngram-based LanguageIdentifier from Nutch 0.9. Since it's Apache 2 licensed, it might be something worth integrating with Tika directly? > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702953#action_12702953 ] Jukka Zitting commented on TIKA-209: ------------------------------------ The getConfidence() method in CharsetMatch is for the confidence level of the character encoding detection, not of the language detection. I'm not sure if ICU4J has an easy way to determine the confidence level of language detection. Robert: Do you know how the LanguageIdentifier stuff differs from the stuff in ICU4J? > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702962#action_12702962 ] Robert Newson commented on TIKA-209: ------------------------------------ Yes, it analyzes the frequencies of the ngrams and compares them to the ngram profiles that it packages. > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711326#action_12711326 ] Jukka Zitting commented on TIKA-209: ------------------------------------ I think something like that would be interesting for Tika. Would you like to contribute a patch? > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711333#action_12711333 ] Robert Newson commented on TIKA-209: ------------------------------------ Sure thing. I adapted the Nutch code for my project, it should be straightforward to do the same for Tika. It's all Apache licensed. It'll be a few days, swamped with other stuff. > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731621#action_12731621 ] Ted Dunning commented on TIKA-209: ---------------------------------- I haven't looked at the nutch code in forever, but my memory is that it didn't use the best statistics for the task. Here is an approach that seems to be more accurate: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.1958 Sadly, I don't have a Java implementation of this handy. I can give out an ancient C implementation. > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732130#action_12732130 ] Jukka Zitting commented on TIKA-209: ------------------------------------ Anything would be fine. I'm sure we can find someone to port the code to Java. > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733063#action_12733063 ] Jukka Zitting commented on TIKA-209: ------------------------------------ I gave a look at the Nutch LanguageIdentifier code. It's indeed something we could use without too much effort. > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733092#action_12733092 ] Chris A. Mattmann commented on TIKA-209: ---------------------------------------- Hey Guys: Awesome -- that was the intention for Tika from the beginning -- Jerome and I originally proposed this as a downstream feature and I think that the time has come. Thanks, Chris > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-209: ----------------------------------- Component/s: languageidentifier Fix Version/s: 0.5 - set component and fix version > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Components: languageidentifier > Affects Versions: 0.3 > Reporter: Robert Newson > Fix For: 0.5 > > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Resolved: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-209. -------------------------------- Resolution: Fixed I have refactored and simplified the language identifier code to better meet the needs of Tika. Most notably I fixed the ngram length to three characters to reduce the size of the language profile files and to make the ngram classes simpler. AutoDetectParser now automatically attempts to detect the document language and sets the Metadata.LANGUAGE property if a reasonably certain language profile match is found. > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Components: languageidentifier > Affects Versions: 0.3 > Reporter: Robert Newson > Fix For: 0.5 > > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Assigned: (TIKA-209) Language detection is weak.[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-209: -------------------------------------- Assignee: Jukka Zitting - jukka fixed this, so assign goes to him > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Components: languageidentifier > Affects Versions: 0.3 > Reporter: Robert Newson > Assignee: Jukka Zitting > Fix For: 0.5 > > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free embeddable forum powered by Nabble | Forum Help |