Wellington Macintosh Society Incorporated

Two Bytes of the Cherry: Unicode and Mac OS X, Part 1

by Matt Neuburg <matt@tidbits.com>
If you're using Mac OS X, a massive revolution is proceedingunnoticed on your computer. No, I don't mean Unix, preemptivemultitasking, or any other familiar buzzwords. I'm talking abouttext.

How can text be revolutionary? Text is not sexy. We take text forgranted, typing it, reading it, editing it, storing it. Text isone of the main reasons most people bought computers in the firstplace. It's a means, a medium; it's not an end, not somethingexplicit. The keyboard lies under our hands; strike a key and thecorresponding letter appears. What could be simpler?

But the more you know about text and how it works on a computer,the more amazing it is that you can do any typing at all. Thereare issues of what keyboard you're using, how the physical keysmap to virtual keycodes, how the virtual keycodes are representedas characters, how to draw the characters on the screen, and howstore information about them in files. There are problems oflanguages, fonts, uppercase and lowercase, diacritics, sort order,and more.

In this article I'll focus on just one aspect of text: Unicode.Whether or not you've heard of Unicode, it affects you. Mac OS Xis a Unicode system. Its native strings are Unicode strings. Manyof the fonts that come with Mac OS X are Unicode fonts.

But there are problems. Mac OS X's transition to Unicode is farfrom complete. There are places where Unicode doesn't work, whereit isn't implemented properly, where it gets in your way. Perhapsyou've encountered some of these, shrugged, and moved on, neversuspecting the cause. Well, from now on, perhaps you'll notice theproblems a little more and shrug a little less. More important,you'll be prepared for the future, because Unicode is coming. It'sheavily present on Mac OS X, and it's only going to become moreso. Unicode is the future - your future. And as my favorite moviesays, we are all interested in the future, since that is where weshall spend the rest of our lives.

ASCII No Questions

To understand the future, we must startwith the past.

In the beginning was writing, the printing press, books, thetypewriter, and in particular a special kind of typewriter forsending information across electrical wires - the teletype.Perhaps you've seen one in an old movie, clattering out a newsstory or a military order. Teletype machines worked by encodingtyped letters of the alphabet as electrical impulses and decodingthem on the other end.

When computers started to be interactive and remotely operable,teletypes were a natural way to talk to them; and the firstuniversal standard computer "alphabet" emerged, not without somestruggle, from how teletypes worked. This was ASCII (pronounced"askey"), the American Standard Code for Information Interchange;and you can still see the teletype influence in the presence ofits "control codes," so called because they helped control theteletype at the far end of the line. (For example, hittingControl-G sent a control code which made a bell ring on theremote teletype, to get the operator's attention - the ancestorof today's alert beep.)

The United States being the major economic and technological forcein computing, the ASCII characters were the capital and smallletters of the Roman alphabet, along with some common typewriterpunctuation and the control codes. The set originally comprised128 characters. That number is, of course, a power of 2 - nocoincidence, since binary lies at the heart of computers.

When I got an Apple IIc, I was astounded to find ASCII extendedby another power of 2, to embrace 256 characters. This made sensemathematically, because 256 is 8 binary bits - a byte, which wasthe minimum unit of memory data. This was less wasteful, but itwas far from clear what to do with the extra 128 characters, whichwere referred to as "high ASCII" to distinguish them from theoriginal 128 "low ASCII" characters. The problem was thecomputer's monitor - its screen. In those days, screenrepresentation of text was wired into the monitor's hardware,and low ASCII was all it could display.

Flaunt Your Fonts, Watch Your Language

When the Macintoshcame along in 1984, everything changed. The Mac's entire screendisplayed graphics, and the computer itself, not the monitorhardware, had the job of constructing the letters when text wasto be displayed. At the time this was stunning and absolutelyrevolutionary. A character could be anything whatever, and forthe first time, people saw all 256 characters really being used.To access high ASCII, you pressed the Option key. What you sawwhen you did so was amazing: A bullet! A paragraph symbol!A c-cedilla! Thus arrived the MacRoman character set to whichwe've all become accustomed.

Since the computer was drawing the character, you also had achoice of fonts - another revolution. After the delirium ofplaying with the Venice and San Francisco fonts started towear off, users saw that this had big consequences for therepresentation of non-Roman languages. After all, no law tiedthe 256 keycodes to the 256 letters of the MacRoman character set.A different font could give you 256 _more_ letters - as the Symbolfont amply demonstrated. This, in fact, is why I switched to aMac. In short order I was typing Greek, Devanagari (the Sanskritsyllabary), and phonetic symbols. After years of struggling withinternational typewriters or filling in symbols by hand, I wasnow my own typesetter, and in seventh heaven.

Trouble in Paradise

Heaven, however, had its limits.Suppose I wanted to print a document. Laser printers wereexpensive, so I had to print in a Mac lab where the computersdidn't necessarily have the same fonts I did, and thus couldn'tprint my document properly. The same problem arose if I wanted togive a file to a colleague or a publisher who might not have thefonts I was using, and so couldn't view my document properly.

Windows users posed yet another problem. The Windows characterset was perversely different from the Mac. For example, WinLatin1(often referred to, somewhat inaccurately, as ISO 8859-1) placesthe upside-down interrogative that opens a Spanish question atcode 191; but that character is 192 on Mac (where 191 is theNorwegian slashed-o).

And even among Mac users, "normal" fonts came in many linguisticvarieties, because the 256 characters of MacRoman do not sufficefor every language that uses a variation of the Roman alphabet.Consider Turkish, for instance. MacRoman includes a Turkishdotless-i, but not a Turkish s-cedilla. So on a Turkish Mac thes-cedilla replaces the American Mac's "fl" ligature. A parallelthing happens on Windows, where (for example) Turkish s-cedillaand the Old English thorn characters occupy the same numericspot in different language systems.

Tower of Babel

None of this would count as problematic wereit not for communications. If your computing is confined to yourown office and your own printer and your own documents, you canwork just fine. But cross-platform considerations introduce anew twist, and of course the rise of the Internet really broughtthings to a head. Suddenly people whose base systems differedwere sending each other email and reading each other's Web pages.Conventions were established for coping, but these work only tothe extent that people and software obey them. If you've everreceived email from someone named "=?iso-8859-1?Q?St=E9phane?=,"or if you've read a Web page where quotes appeared as a funny-looking capital O, you've experienced some form of the problem.

Also, since fonts don't travel across the Internet, charactersthat depend on a particular font may not be viewable at all. HTMLcan ask that certain characters should appear in a certain fonton your machine when you view my page, but a fat lot of good thatwill do if you don't have that font.

Finally, there is a major issue I haven't mentioned yet: for somewriting systems, 256 characters is nowhere near enough. An obviousexample is Chinese, which requires several thousand characters.

Enter Unicode.

The Premise and the Promise

What Unicode proposes is simpleenough: increase the number of bytes used to represent eachcharacter. For example, if you use two bytes per character,you can have 65,536 characters - enough to represent the Romanalphabet plus various accents and diacritics, plus Greek, Russian,Hebrew, Arabic, Devanagari, the core symbols of various Asianlanguages, and many others.

What's new here isn't the codification of character codes torepresent different languages; the various existing character setsalready did that, albeit clumsily. Nor is it the use of a double-byte system; such systems were already in use to represent Asiancharacters. What's new is the grand unification into a singlecharacter set embracing all characters at once. In other words,Unicode would do away with character set variations acrosssystems and fonts. In fact, in theory a single (huge) fontcould potentially contain all needed characters.

It turns out, actually, that even 65,536 symbols aren't enough,once you start taking into account specialized scholars'requirements for conventional markings and historical characters(about which the folks who set the Unicode standards have oftenproved not to be as well informed as they like to imagine).Therefore Unicode has recently been extended to a potential 16further sets of 65,536 characters (called "supplementary planes");the size of the potential character set thus approximates amillion, with each character represented by at most 4 bytes. Thefirst supplementary plane is already being populated with suchthings as Gothic; musical and mathematical symbols; Mycenean(Linear B); and Egyptian hieroglyphics. The evolving standardis, not surprisingly, the subject of various political, cultural,technical, and scholarly struggles.

<http://www.unicode.org/><http://www.unicode.org/unicode/standard/principles.html>

What has all this to do with you, you ask? It's simple. As Isaid at the outset, if you're a Mac OS X user, Unicode is onyour computer, right now. But where? In the second half ofthis article, I'll show you.

Wellington Macintosh Society Inc. 2002