[H-GEN] ASCII Characters
Benjamin
benjamincarlyle at optusnet.com.au
Fri Oct 24 22:56:55 EDT 2003
[ Humbug *General* list - semi-serious discussions about Humbug and ]
[ Unix-related topics. Posts from non-subscribed addresses will vanish. ]
Joe Skilton wrote:
> From: "Andrew Pullin" <andrew at hotspurbgc.com.au>
>> What characters do you want to keep?
> Unfortunatly I want to keep the 'valid' (non-ascii) characters, and throw
> out the rest.
Hello.
Just to clarify: It's all ASCII. Or ISO-8859-1, or UTF-8, or... well
there are many character encodings and ASCII is just one. As the
man-page says "Many 8-bit codes contain ASCII as their lower half", so
you can often do filtering designed for ASCII on other character encodings.
Where is all this character-encoding business coming from? Well,
naturally computers don't work in letters. They work in numbers. Your
average processor doesn't know what an 'a' is, it has to see a number
that represents 'a'.
An obvious encoding is a = 1, b = 2, and so on until z = 26... but hang
on. What if I want to encode capital letters differently to lower-case
letters. Maybe you add A = 27, B = 28, and so on until Z = 52. But hang
on. What about spaces and punctuation?
This is how ASCII came about. Every computer system around seemed to
have a different way of encoding these very basic character concepts, so
it was standardised for the purpose of communicating between the
machines. Each machine could use its own encoding but would send and
recieve data in the ASCII format (That's why it's the American Standard
Code for -Information Interchange-). ASCII contains the english
alphabet, a whole bunch of puncuation, etc, and some control codes (the
stuff you've been trying to filter out).
Anyway, time marches on and soon most computers were using ASCII
internally as well. It saves on re-inventing the wheel, and saves on
conversion, too. The trouble was that ASCII has significant
short-comings. There was no way to encode many characters from other
languages. Standards like ISO-8859-1 and Unicode came about to allow
other languages to be used consitently for interchanging information
between computer systems. ISO-8859-1 covers (as I recall) all your
Europoean lanaguages. Obviously Japan, China, and the like felt very
much left out of this process and eventually Unicode came along to cover
all bases.
By this time ASCII was so pervasive that there was no point fighting it.
The 8-bit encodings of ISO-8859-1 and Unicode both use ASCII as their
first 7-bits.
Benjamin
--
Any outright lies or untruths in the preceeding summary are based on the
fact that I did not reasearch a letter of it... oh, except for looking
up the ascii(7) man page ;)
--
* This is list (humbug) general handled by majordomo at lists.humbug.org.au .
* Postings to this list are only accepted from subscribed addresses of
* lists 'general' or 'general-post'. See http://www.humbug.org.au/
More information about the General
mailing list