|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
Language detectionI need to write (PHP) code to detect the language of a given block of
text. (For my purposes I want to initially distinguish between English, Japanese, German, Simplified Mandarin, Traditional Mandarin, Arabic, Korean, French) I want it to be reliable so my plan was to have a list of unicode points only found in each given language [1], and use that to return a high confidence answer. If none found, then have a list of high frequency words for each language [2] and use that to return a lower confidence answer. Like most of my i18n-related php code I'll release as BSD-license open source. But I wondered if there already existed something I could build on. (Or comprehensive lists of unicode points only used in certain languages; I have some small ad hoc lists, but the more I have the more useful the algorithm is.) (I'm aware of letter-frequency techniques, http://en.wikipedia.org/wiki/Letter_frequencies but haven't worked out where that is ever more useful than word analysis?) Darren [1]: E.g. scharfes-s for German, katakana/hiragana for Japanese (also, http://en.wiktionary.org/wiki/Category:Japanese-only_CJKV_Characters ). Arabic and Korean also have unique alphabets. Accents for French. [2]: E.g. for English "the", "be", "to", etc. http://en.wikipedia.org/wiki/Most_common_words_in_English Same list for German: http://de.wikipedia.org/wiki/Liste_der_h%C3%A4ufigsten_W%C3%B6rter_der_deutschen_Sprache -- Darren Cook, Software Researcher/Developer http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic open source dictionary/semantic network) http://dcook.org/work/ (About me and my work) http://darrendev.blogspot.com/ (blog on php, flash, i18n, linux, ...) -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php |
|
|
RE: Language detection> I need to write (PHP) code to detect the language of a given block of
> text. Your proposed approach is very simplistic and probably won't be extensible if it works at all. Usually a statistical approach is taken using groups of characters. In any case, ICU has this. See http://icu-project.org/userguide/charsetDetection.html It has both charset and language detection. This is also available via Win32 and .NET APIs in case that helps at all. If you roll your own you might want to be aware that there are a lot of patents in this area. =Ed -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php |
|
|
Re: Language detectionZitat von Darren Cook <darren@...>:
> I need to write (PHP) code to detect the language of a given block of > text. (For my purposes I want to initially distinguish between English, > Japanese, German, Simplified Mandarin, Traditional Mandarin, Arabic, > Korean, French) I want it to be reliable so my plan was to have a list > of unicode points only found in each given language [1], and use that to > return a high confidence answer. If none found, then have a list of high > frequency words for each language [2] and use that to return a lower > confidence answer. http://pear.php.net/package/Text_LanguageDetect Jan. -- Do you need professional PHP or Horde consulting? http://horde.org/consulting/ -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php |
|
|
Re: Language detection>> I need to write (PHP) code to detect the language of a given block of
>> text. > In any case, ICU has this. See > > http://icu-project.org/userguide/charsetDetection.html > > It has both charset and language detection. Thanks Ed. Unless I've misunderstood, this is just doing charset detection, with language as a bonus when the charset implies it? If someone is actually using this and can confirm it can tell the difference between say English, French and German, all in UTF-8 encoding, please let me know. Thanks, Darren -- Darren Cook, Software Researcher/Developer http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic open source dictionary/semantic network) http://dcook.org/work/ (About me and my work) -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php |
|
|
Re: Language detection> http://pear.php.net/package/Text_LanguageDetect
Thanks to both people who suggested this; it is just what I was hoping to find (and that google didn't). I'll start evaluating it then roll-my-own on top if it isn't accurate enough. Darren -- Darren Cook, Software Researcher/Developer http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic open source dictionary/semantic network) http://dcook.org/work/ (About me and my work) -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php |
|
|
RE: Language detection> Thanks Ed. Unless I've misunderstood, this is just doing charset > detection, with language as a bonus when the charset implies it? That wouldn't be very useful. No, it uses recognizers for charset/language combinations. > difference between say English, French and German, all in UTF-8 > encoding, please let me know. It does not have data to do any utf-8 language detection, but the structure is in place. You might want to consider adding data to their framework for what you want to do. It isn't complicated. The most important thing you need is good sample text in quantity so you can generate the n-gram probability table. I believe the code was taken from Mozilla, so you might look there. Maybe they've already done what you are looking for. =Ed -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php |
| Free embeddable forum powered by Nabble | Forum Help |