[H-GEN] programatically detecting language of UTF-8 text

mick mickhowe at bigpond.net.au
Tue Mar 31 20:02:53 EDT 2009


On Tue, 31 Mar 2009 12:01:23 Clinton Roy wrote:
> [ Humbug *General* list - semi-serious discussions about Humbug and     ]
> [ Unix-related topics. Posts from non-subscribed addresses will vanish. ]
>
> Hi Mick,
>
> I'm not a Unicode guru, but I do know the basics.
>
> On Tue, Mar 31, 2009 at 9:50 AM, mick <mickhowe at bigpond.net.au> wrote:
> > I'm working on a yahoo chat client and wish to filter out text in
> > selected languages.
>
> That seems odd on quite a few levels, but oh well :)
no dispute, yahoo really is a waste of space. especially with Egytian boys 
flooding the rooms with demands for women to get naked for them on webcam, 
written in arabic script in poster sized text. (I have dropped some their 
posts into a translator)

I can't read chinese, japanese, etc. characters so there is no point me seeing 
them, therefore it would suit me and many of the people I chat with to be able 
to simply filter out these. 
>
> > several days of searching and reading many documents has revealed much
> > hype and propaganda and some info on coding into UTF-8 but nothing on
> > analysis of strings for info like what language a string is in.
>
> UTF-8 is just an encoding of Unicode. Unicode does not deal in
> languages, just written symbols; Unicode refers to groups of symbols
> as scripts. Every codepoint (written symbol) belongs to exactly one
> script[1]. There is a Unicode database that maps from code points to
> scripts.
>
> So, you could read your utf-8 string in, map from the code point to a
> script, then have your own ideological mapping from script to
> language.
language is probably the wrong word in this context, maybe nation character 
set is more suitable

Thanks
/]/]ik




More information about the General mailing list