Language detection

View: New views
6 Messages — Rating Filter:   Alert me  

Language detection

by Darren Cook :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I need to write (PHP) code to detect the language of a given block of
text. (For my purposes I want to initially distinguish between English,
Japanese, German, Simplified Mandarin, Traditional Mandarin, Arabic,
Korean, French) I want it to be reliable so my plan was to have a list
of unicode points only found in each given language [1], and use that to
return a high confidence answer. If none found, then have a list of high
frequency words for each language [2] and use that to return a lower
confidence answer.

Like most of my i18n-related php code I'll release as BSD-license open
source. But I wondered if there already existed something I could build
on. (Or comprehensive lists of unicode points only used in certain
languages; I have some small ad hoc lists, but the more I have the more
useful the algorithm is.)

(I'm aware of letter-frequency techniques,
http://en.wikipedia.org/wiki/Letter_frequencies but haven't worked out
where that is ever more useful than word analysis?)

Darren

[1]: E.g. scharfes-s for German, katakana/hiragana for Japanese (also,
http://en.wiktionary.org/wiki/Category:Japanese-only_CJKV_Characters ).
Arabic and Korean also have unique alphabets. Accents for French.

[2]: E.g. for English "the", "be", "to", etc.
http://en.wikipedia.org/wiki/Most_common_words_in_English
Same list for German:
http://de.wikipedia.org/wiki/Liste_der_h%C3%A4ufigsten_W%C3%B6rter_der_deutschen_Sprache


--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
                        open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://darrendev.blogspot.com/ (blog on php, flash, i18n, linux, ...)

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: Language detection

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I need to write (PHP) code to detect the language of a given block of
> text.

Your proposed approach is very simplistic and probably won't be extensible
if it works at all. Usually a statistical approach is taken using groups of
characters.

In any case, ICU has this. See

http://icu-project.org/userguide/charsetDetection.html

It has both charset and language detection. This is also available via Win32
and .NET APIs in case that helps at all.

If you roll your own you might want to be aware that there are a lot of
patents in this area.

=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: Language detection

by Jan Schneider :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Zitat von Darren Cook <darren@...>:

> I need to write (PHP) code to detect the language of a given block of
> text. (For my purposes I want to initially distinguish between English,
> Japanese, German, Simplified Mandarin, Traditional Mandarin, Arabic,
> Korean, French) I want it to be reliable so my plan was to have a list
> of unicode points only found in each given language [1], and use that to
> return a high confidence answer. If none found, then have a list of high
> frequency words for each language [2] and use that to return a lower
> confidence answer.

http://pear.php.net/package/Text_LanguageDetect

Jan.

--
Do you need professional PHP or Horde consulting?
http://horde.org/consulting/


--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: Language detection

by Darren Cook :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>> I need to write (PHP) code to detect the language of a given block of
>> text.

> In any case, ICU has this. See
>
> http://icu-project.org/userguide/charsetDetection.html
>
> It has both charset and language detection.

Thanks Ed. Unless I've misunderstood, this is just doing charset
detection, with language as a bonus when the charset implies it? If
someone is actually using this and can confirm it can tell the
difference between say English, French and German, all in UTF-8
encoding, please let me know.

Thanks,

Darren




--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
                        open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)


--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: Language detection

by Darren Cook :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> http://pear.php.net/package/Text_LanguageDetect

Thanks to both people who suggested this; it is just what I was hoping
to find (and that google didn't). I'll start evaluating it then
roll-my-own on top if it isn't accurate enough.

Darren


--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
                        open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: Language detection

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> Thanks Ed. Unless I've misunderstood, this is just doing charset
> detection, with language as a bonus when the charset implies it?

That wouldn't be very useful. No, it uses recognizers for charset/language
combinations.

> difference between say English, French and German, all in UTF-8
> encoding, please let me know.

It does not have data to do any utf-8 language detection, but the structure
is in place.

You might want to consider adding data to their framework for what you want
to do. It isn't complicated. The most important thing you need is good
sample text in quantity so you can generate the n-gram probability table.

I believe the code was taken from Mozilla, so you might look there. Maybe
they've already done what you are looking for.

=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php