[H-GEN] ASCII Characters

Fri Oct 24 22:56:55 EDT 2003

[ Humbug *General* list - semi-serious discussions about Humbug and     ]
[ Unix-related topics. Posts from non-subscribed addresses will vanish. ]

Joe Skilton wrote:
> From: "Andrew Pullin" <andrew at hotspurbgc.com.au>
>>    What characters do you want to keep?
> Unfortunatly I want to keep the 'valid' (non-ascii) characters, and throw
> out the rest.

Hello.

Just to clarify: It's all ASCII. Or ISO-8859-1, or UTF-8, or... well 
there are many character encodings and ASCII is just one. As the 
man-page says "Many 8-bit codes contain ASCII as their lower  half", so 
you can often do filtering designed for ASCII on other character encodings.

Where is all this character-encoding business coming from? Well, 
naturally computers don't work in letters. They work in numbers. Your 
average processor doesn't know what an 'a' is, it has to see a number 
that represents 'a'.

An obvious encoding is a = 1, b = 2, and so on until z = 26... but hang 
on. What if I want to encode capital letters differently to lower-case 
letters. Maybe you add A = 27, B = 28, and so on until Z = 52. But hang 
on. What about spaces and punctuation?

This is how ASCII came about. Every computer system around seemed to 
have a different way of encoding these very basic character concepts, so 
  it was standardised for the purpose of communicating between the 
machines. Each machine could use its own encoding but would send and 
recieve data in the ASCII format (That's why it's the American Standard 
Code for -Information Interchange-). ASCII contains the english 
alphabet, a whole bunch of puncuation, etc, and some control codes (the 
stuff you've been trying to filter out).

Anyway, time marches on and soon most computers were using ASCII 
internally as well. It saves on re-inventing the wheel, and saves on 
conversion, too. The trouble was that ASCII has significant 
short-comings. There was no way to encode many characters from other 
languages. Standards like ISO-8859-1 and Unicode came about to allow 
other languages to be used consitently for interchanging information 
between computer systems. ISO-8859-1 covers (as I recall) all your 
Europoean lanaguages. Obviously Japan, China, and the like felt very 
much left out of this process and eventually Unicode came along to cover 
all bases.

By this time ASCII was so pervasive that there was no point fighting it. 
The 8-bit encodings of ISO-8859-1 and Unicode both use ASCII as their 
first 7-bits.

Benjamin
--
Any outright lies or untruths in the preceeding summary are based on the 
fact that I did not reasearch a letter of it... oh, except for looking 
up the ascii(7) man page ;)

--
* This is list (humbug) general handled by majordomo at lists.humbug.org.au .
* Postings to this list are only accepted from subscribed addresses of
* lists 'general' or 'general-post'.  See http://www.humbug.org.au/