[jira] Created: (TIKA-209) Language detection is weak.

View: New views
13 Messages — Rating Filter:   Alert me  

[jira] Created: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Language detection is weak.
---------------------------

                 Key: TIKA-209
                 URL: https://issues.apache.org/jira/browse/TIKA-209
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.3
            Reporter: Robert Newson


in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.

Please add a configurable level (0-100);

if (language != null && match.getConfidence() > THRESHOLD) {
  metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
  metadata.set(Metadata.LANGUAGE, match.getLanguage());
}

Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688660#action_12688660 ]

Robert Newson commented on TIKA-209:
------------------------------------

FYI: In my project (couchdb-lucene) I've pulled in the ngram-based LanguageIdentifier from Nutch 0.9. Since it's Apache 2 licensed, it might be something worth integrating with Tika directly?

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702953#action_12702953 ]

Jukka Zitting commented on TIKA-209:
------------------------------------

The getConfidence() method in CharsetMatch is for the confidence level of the character encoding detection, not of the language detection.

I'm not sure if ICU4J has an easy way to determine the confidence level of language detection.

Robert: Do you know how the LanguageIdentifier stuff differs from the stuff in ICU4J?

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702962#action_12702962 ]

Robert Newson commented on TIKA-209:
------------------------------------


Yes, it analyzes the frequencies of the ngrams and compares them to the ngram profiles that it packages.


> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711326#action_12711326 ]

Jukka Zitting commented on TIKA-209:
------------------------------------

I think something like that would be interesting for Tika. Would you like to contribute a patch?

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711333#action_12711333 ]

Robert Newson commented on TIKA-209:
------------------------------------

Sure thing. I adapted the Nutch code for my project, it should be straightforward to do the same for Tika. It's all Apache licensed.

It'll be a few days, swamped with other stuff.

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731621#action_12731621 ]

Ted Dunning commented on TIKA-209:
----------------------------------


I haven't looked at the nutch code in forever, but my memory is that it didn't use the best statistics for the task.  Here is an approach that seems to be more accurate:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.1958

Sadly, I don't have a Java implementation of this handy.  I can give out an ancient C implementation.





> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732130#action_12732130 ]

Jukka Zitting commented on TIKA-209:
------------------------------------

Anything would be fine. I'm sure we can find someone to port the code to Java.

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733063#action_12733063 ]

Jukka Zitting commented on TIKA-209:
------------------------------------

I gave a look at the Nutch LanguageIdentifier code. It's indeed something we could use without too much effort.

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733092#action_12733092 ]

Chris A. Mattmann commented on TIKA-209:
----------------------------------------

Hey Guys:

Awesome -- that was the intention for Tika from the beginning -- Jerome and I originally proposed this as a downstream feature and I think that the time has come.

Thanks,
Chris


> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-209:
-----------------------------------

      Component/s: languageidentifier
    Fix Version/s: 0.5

- set component and fix version

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>             Fix For: 0.5
>
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-209.
--------------------------------

    Resolution: Fixed

I have refactored and simplified the language identifier code to better meet the needs of Tika. Most notably I fixed the ngram length to three characters to reduce the size of the language profile files and to make the ngram classes simpler.

AutoDetectParser now automatically attempts to detect the document language and sets the Metadata.LANGUAGE property if a reasonably certain language profile match is found.

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>             Fix For: 0.5
>
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (TIKA-209) Language detection is weak.

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned TIKA-209:
--------------------------------------

    Assignee: Jukka Zitting

- jukka fixed this, so assign goes to him

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.