[H-GEN] programatically detecting language of UTF-8 text

Clinton Roy clinton.roy at gmail.com
Mon Mar 30 22:01:23 EDT 2009


Hi Mick,

I'm not a Unicode guru, but I do know the basics.

On Tue, Mar 31, 2009 at 9:50 AM, mick <mickhowe at bigpond.net.au> wrote:

> I'm working on a yahoo chat client and wish to filter out text in selected
> languages.

That seems odd on quite a few levels, but oh well :)

> several days of searching and reading many documents has revealed much hype
> and propaganda and some info on coding into UTF-8 but nothing on analysis of
> strings for info like what language a string is in.

UTF-8 is just an encoding of Unicode. Unicode does not deal in
languages, just written symbols; Unicode refers to groups of symbols
as scripts. Every codepoint (written symbol) belongs to exactly one
script[1]. There is a Unicode database that maps from code points to
scripts.

So, you could read your utf-8 string in, map from the code point to a
script, then have your own ideological mapping from script to
language.

On my Debian system, the /usr/share/unicode/Scripts.txt database is
part of the unicode-data package.

hope that helps,

[1] Well, that used to be the case anyway.
-- 
Clinton Roy
CSIRO - Robotics Platform Engineer
Autonomous Systems Lab

humbug.org.au  - Brisbane Unix Group
azure.humbug.org.au/~croy/blog - Blog
flickr.com/photos/croy/ - Photos




More information about the General mailing list