by Matt Neuburg
<matt@tidbits.com>
If you're using Mac OS X, a
massive revolution is proceedingunnoticed on your
computer. No, I don't mean Unix, preemptivemultitasking,
or any other familiar buzzwords. I'm talking
abouttext.
How can text be revolutionary?
Text is not sexy. We take text forgranted, typing it,
reading it, editing it, storing it. Text isone of the
main reasons most people bought computers in the
firstplace. It's a means, a medium; it's not an end, not
somethingexplicit. The keyboard lies under our hands;
strike a key and thecorresponding letter appears. What
could be simpler?
But the more you know about
text and how it works on a computer,the more amazing it
is that you can do any typing at all. Thereare issues of
what keyboard you're using, how the physical keysmap to
virtual keycodes, how the virtual keycodes are
representedas characters, how to draw the characters on
the screen, and howstore information about them in files.
There are problems oflanguages, fonts, uppercase and
lowercase, diacritics, sort order,and more.
In this article I'll focus on
just one aspect of text: Unicode.Whether or not you've
heard of Unicode, it affects you. Mac OS Xis a Unicode
system. Its native strings are Unicode strings. Manyof
the fonts that come with Mac OS X are Unicode
fonts.
But there are problems. Mac OS
X's transition to Unicode is farfrom complete. There are
places where Unicode doesn't work, whereit isn't
implemented properly, where it gets in your way.
Perhapsyou've encountered some of these, shrugged, and
moved on, neversuspecting the cause. Well, from now on,
perhaps you'll notice theproblems a little more and shrug
a little less. More important,you'll be prepared for the
future, because Unicode is coming. It'sheavily present on
Mac OS X, and it's only going to become moreso. Unicode
is the future - your future. And as my favorite
moviesays, we are all interested in the future, since
that is where weshall spend the rest of our
lives.
ASCII No Questions
To understand the future, we
must startwith the past.
In the beginning was writing,
the printing press, books, thetypewriter, and in
particular a special kind of typewriter forsending
information across electrical wires - the
teletype.Perhaps you've seen one in an old movie,
clattering out a newsstory or a military order. Teletype
machines worked by encodingtyped letters of the alphabet
as electrical impulses and decodingthem on the other
end.
When computers started to be
interactive and remotely operable,teletypes were a
natural way to talk to them; and the firstuniversal
standard computer "alphabet" emerged, not without
somestruggle, from how teletypes worked. This was ASCII
(pronounced"askey"), the American Standard Code for
Information Interchange;and you can still see the
teletype influence in the presence ofits "control codes,"
so called because they helped control theteletype at the
far end of the line. (For example, hittingControl-G sent
a control code which made a bell ring on theremote
teletype, to get the operator's attention - the
ancestorof today's alert beep.)
The United States being the
major economic and technological forcein computing, the
ASCII characters were the capital and smallletters of the
Roman alphabet, along with some common
typewriterpunctuation and the control codes. The set
originally comprised128 characters. That number is, of
course, a power of 2 - nocoincidence, since binary lies
at the heart of computers.
When I got an Apple IIc, I was
astounded to find ASCII extendedby another power of 2, to
embrace 256 characters. This made sensemathematically,
because 256 is 8 binary bits - a byte, which wasthe
minimum unit of memory data. This was less wasteful, but
itwas far from clear what to do with the extra 128
characters, whichwere referred to as "high ASCII" to
distinguish them from theoriginal 128 "low ASCII"
characters. The problem was thecomputer's monitor - its
screen. In those days, screenrepresentation of text was
wired into the monitor's hardware,and low ASCII was all
it could display.
Flaunt Your Fonts, Watch
Your Language
When the Macintoshcame along in
1984, everything changed. The Mac's entire
screendisplayed graphics, and the computer itself, not
the monitorhardware, had the job of constructing the
letters when text wasto be displayed. At the time this
was stunning and absolutelyrevolutionary. A character
could be anything whatever, and forthe first time, people
saw all 256 characters really being used.To access high
ASCII, you pressed the Option key. What you sawwhen you
did so was amazing: A bullet! A paragraph symbol!A
c-cedilla! Thus arrived the MacRoman character set to
whichwe've all become accustomed.
Since the computer was drawing
the character, you also had achoice of fonts - another
revolution. After the delirium ofplaying with the Venice
and San Francisco fonts started towear off, users saw
that this had big consequences for therepresentation of
non-Roman languages. After all, no law tiedthe 256
keycodes to the 256 letters of the MacRoman character
set.A different font could give you 256 _more_ letters -
as the Symbolfont amply demonstrated. This, in fact, is
why I switched to aMac. In short order I was typing
Greek, Devanagari (the Sanskritsyllabary), and phonetic
symbols. After years of struggling withinternational
typewriters or filling in symbols by hand, I wasnow my
own typesetter, and in seventh heaven.
Trouble in
Paradise
Heaven, however, had its
limits.Suppose I wanted to print a document. Laser
printers wereexpensive, so I had to print in a Mac lab
where the computersdidn't necessarily have the same fonts
I did, and thus couldn'tprint my document properly. The
same problem arose if I wanted togive a file to a
colleague or a publisher who might not have thefonts I
was using, and so couldn't view my document
properly.
Windows users posed yet another
problem. The Windows characterset was perversely
different from the Mac. For example, WinLatin1(often
referred to, somewhat inaccurately, as ISO 8859-1)
placesthe upside-down interrogative that opens a Spanish
question atcode 191; but that character is 192 on Mac
(where 191 is theNorwegian slashed-o).
And even among Mac users,
"normal" fonts came in many linguisticvarieties, because
the 256 characters of MacRoman do not sufficefor every
language that uses a variation of the Roman
alphabet.Consider Turkish, for instance. MacRoman
includes a Turkishdotless-i, but not a Turkish s-cedilla.
So on a Turkish Mac thes-cedilla replaces the American
Mac's "fl" ligature. A parallelthing happens on Windows,
where (for example) Turkish s-cedillaand the Old English
thorn characters occupy the same numericspot in different
language systems.
Tower of
Babel
None of this would count as
problematic wereit not for communications. If your
computing is confined to yourown office and your own
printer and your own documents, you canwork just fine.
But cross-platform considerations introduce anew twist,
and of course the rise of the Internet really
broughtthings to a head. Suddenly people whose base
systems differedwere sending each other email and reading
each other's Web pages.Conventions were established for
coping, but these work only tothe extent that people and
software obey them. If you've everreceived email from
someone named "=?iso-8859-1?Q?St=E9phane?=,"or if you've
read a Web page where quotes appeared as a funny-looking
capital O, you've experienced some form of the
problem.
Also, since fonts don't travel
across the Internet, charactersthat depend on a
particular font may not be viewable at all. HTMLcan ask
that certain characters should appear in a certain fonton
your machine when you view my page, but a fat lot of good
thatwill do if you don't have that font.
Finally, there is a major issue
I haven't mentioned yet: for somewriting systems, 256
characters is nowhere near enough. An obviousexample is
Chinese, which requires several thousand
characters.
Enter Unicode.
The Premise and the
Promise
What Unicode proposes is
simpleenough: increase the number of bytes used to
represent eachcharacter. For example, if you use two
bytes per character,you can have 65,536 characters -
enough to represent the Romanalphabet plus various
accents and diacritics, plus Greek, Russian,Hebrew,
Arabic, Devanagari, the core symbols of various
Asianlanguages, and many others.
What's new here isn't the
codification of character codes torepresent different
languages; the various existing character setsalready did
that, albeit clumsily. Nor is it the use of a double-byte
system; such systems were already in use to represent
Asiancharacters. What's new is the grand unification into
a singlecharacter set embracing all characters at once.
In other words,Unicode would do away with character set
variations acrosssystems and fonts. In fact, in theory a
single (huge) fontcould potentially contain all needed
characters.
It turns out, actually, that
even 65,536 symbols aren't enough,once you start taking
into account specialized scholars'requirements for
conventional markings and historical characters(about
which the folks who set the Unicode standards have
oftenproved not to be as well informed as they like to
imagine).Therefore Unicode has recently been extended to
a potential 16further sets of 65,536 characters (called
"supplementary planes");the size of the potential
character set thus approximates amillion, with each
character represented by at most 4 bytes. Thefirst
supplementary plane is already being populated with
suchthings as Gothic; musical and mathematical symbols;
Mycenean(Linear B); and Egyptian hieroglyphics. The
evolving standardis, not surprisingly, the subject of
various political, cultural,technical, and scholarly
struggles.
<http://www.unicode.org/><http://www.unicode.org/unicode/standard/principles.html>