[H-GEN] UTF-8 vs ISO-8859-1

Sun Jun 25 01:06:06 EDT 2006

On Fri, 2006-06-23 at 12:13 +0200, Bruce Campbell wrote:
> In answer to the original question, the only reasons that you, in 
> Australia, would need to go outside US-ASCII is to get foreign currency 
> symbols (the Euro is in ISO-8859-15, not ISO-8859-1 btw), or you are 
> trying to type the proper names of some of the players Australia has come 
> up against recently.

What? No wingdings[1]?

Unicode is the way most software is heading. It has its pluses and
minuses. The main plus is that just about every country in the world
seems pretty much able to agree on it. We can paint an ideal world where
there is only one character set, and everyone can agree what is meant by
each codepoint they process.

Of course, life is not quite so simple. Unicode currently requires more
than 16 bits to represent a code point. Obviously, you can't stuff that
into a byte. C-era software is used to dealing with characters as bytes,
so we need a useful encoding into bytes. UTF-8[2] is that encoding. It
just isn't the only encoding.

UCS-2[3] was used by early windows, and early Java. This encoding was
nice and simple. Two bytes per character. Easy. Unfortunately, it isn't
treated as a pair of bytes. It gets treated as a 16-bit integer. That
means that it depends on which endian you use as to how you represent
it. That was also in the day when they knew eight bits wouldn't be
enough, but believed that surely sixteen would be. That belief turned
out to be misguided. Around the year 2000 vendors were madly swapping
UCS-2 for UTF-16. That provides for the representation of higher
codepoints, but does so as a sequence of two-byte... well... whatevers.
It lands in the worst of both worlds category by being an
architecture-dependant representation that still doesn't represent the
whole code point in a fixed width.

These days UTF-8 and UCS-4[4] are probably the smart encodings, however
the influence of China requires another choice to be considered.
GB18030[5] provides a mapping for all unicode code points, but encodes
them based on the earlier chinese standard of GB2312 and its GBK
extension. UTF-8 can be parsed as plain ascii whenever it contains only
ascii characters. It also never inserts a character that looks like
7-bit ascii when it isn't really that 7-bit ascii character. GB18030 has
the same properties with respect to GBK.

GB18030 is not a unicode encoding exactly. Any overlap between unicode
and GBK lands in the first two bytes of the two- to four- byte sequence.
The rest of unicode is sort of pasted at the end. All software sold in
China is expected to run in this character encoding. Perhaps there are
exceptions, but the general gist is that this is required by law.

Now, you might have thought this was the end of things. We know about
the different kinds of encodings, so we can extract a unicode character
point for that encoding. There are still two more variables:

1. Some unicode code-points need to be collapsed together in order to
render a single glyph to the screen. Does this make it one character, or
two? Just don't go there :)
2. The same unicode code-point may be rendered differently depending on
the language of the text it is written in. Bone in chinse and japanese
has to have a particular hook go to one side or the other in order to
make sense. Still, it is one character despite looking different in the
two languages. Yippee.

More reading:
http://czyborra.com/unicode/characters.html

What role does ISO-8859-1 play in all of this? Well, it's a compact
encoding for some latin characters. For internationalised software it is
the same kind of nuisance as all of those other character encodings that
float about the world.

Benjamin
[1] http://www.alanwood.net/demos/wingdings.html
[2] http://en.wikipedia.org/wiki/UTF-8
[3] http://en.wikipedia.org/wiki/UTF-16/UCS-2
[4] http://en.wikipedia.org/wiki/UTF-32/UCS-4
[5] http://en.wikipedia.org/wiki/GB_18030