For want of a typeface – Increment: Internationalization

Imagine being unable to write on paper in your native tongue. How frustrating would that be—and what alternatives would you have? You may, perhaps in reluctance or anger, use the letters or shapes of another language.

About 4,000 of over 7,000 living languages— languages with at least one native speaker— have a written form, even if that form was developed long after the language developed a distinct identity. With a global literacy rate at 86 percent, most people today would find it absurd to not be able to jot down a note, pen a report, or even write a multi-volume set of books.

But when it comes to digital interaction, not all languages are re-created equal. Some letters, logograms, flowing shapes, or other components used by a single language or shared by several—collectively known as a script—lack representation in typefaces appropriate for digital display, especially for mobile devices and standalone hardware, like Blu-ray players and HDTVs. This leaves some people largely unable to interact in the language and form they know using digital interfaces, whether for product navigation, reading, or writing.

No standards body or company made an intentional decision to exclude particular languages. The focus on font-making, however, veers towards what’s commercially necessary—what’s used by millions or billions—whether for an operating system like those from Microsoft or Apple, or for an independent digital typeface foundry. This results in a linguistic hegemony that’s picked winners and losers among languages and scripts.

Speakers of some languages and users of some scripts have experienced brutal repression, such as with the Kurds in Syria and Turkey, but Kurdish fonts remain available and in some abundance. The digital divide of font haves and have-nots is instead an outgrowth of a history that has left some languages with so few speakers that people with the right expertise haven’t designed typefaces for those scripts.

Some of it also boils down to simplicity. The Roman alphabet used in the Americas, Europe, and parts of Asia has at most 26 letterforms, augmented by fiddly bits like the cedilla (ç), the l-slash in Polish and other Slavic languages (ł), and diacritical marks like ̈ and ́. Added to that, the Roman alphabet’s broad use and later adoption for some tongues, such as Turkish, has led to 2.3 billion people currently relying on the script.

But in an age of digital plenty, shouldn’t it be possible for every script for every language to have its place at the digital table?

The technical answer to that question is, seemingly, yes—we have Unicode! Unicode is a master list of every symbol necessary for every script, academic discipline, and beyond (see: emoji), assembled and continuously expanded by the Unicode Consortium. Unicode’s inventory is now at 140,000: a mix of characters in scripts, dingbats, arrows, and math symbols, and it has room for over a million more. Across the decades, Unicode has incorporated so many of the pieces needed for popular, contemporary languages that it’s now working on what remains: adding symbols for scripts and languages used by few living people (so-called “minority scripts”); missing characters from languages like Chinese, Japanese, and Korean; and making room for the more obscure dead languages—not just, say, Latin or Ancient Greek—used by scholars and historians. (This can include, for instance, Ogham, which has some characters seen only on eroded tombstones.)

However, Unicode is only part of the solution. Unicode defines a numbered location for what a character represents and a kind of schematic of what it should look like. “Devanagari Vowel Sign Candra Long E” (known in Unicode as U+0955 or ॕ), for instance, refers to a particular character in that script, but doesn’t explain at all how to draw it. Instead, it’s up to typeface designers and digital font foundries to create versions of those characters that can be entered on keyboards and touchscreens and displayed on devices worldwide.

Some scripts still have few or no faces designed for screen use, only for print. “Hobbyists or linguists or people who are not versed in this black art of type design have made efforts to represent languages,” said Steve Matteson, Monotype’s creative type director and the company’s lead on Google’s Noto project. Not to paint type designers as keepers of a secret art—there are many successful typefaces from people who have taught themselves. But without many years of learning the intricacies, few can readily produce faces that meet technical requirements and look correct to users.

Typefaces for given scripts may also have been designed by non-native speakers or may represent only a particular cultural interpretation of the language. With Arabic, for instance, Matteson said some speakers who reviewed drafts of a script Monotype developed for Google supported a “progressive” look to the typeface, while others had a clear notion of traditional form that needed to be retained. A type designer’s brief includes finding a way to navigate how a script appears when it has a few or problematic digital renditions.

And while the Unicode Consortium can seek to ensure all scripts are represented in their inventory, its job isn’t to make sure fonts exist to represent all of those elements. Instead, Google has taken on this modern debabelization.

Google’s Noto project, started seven years ago, already represents tens of thousands of Unicode characters with a single consistent typeface design. Google ultimately plans to add everything in Unicode while giving the fonts away and taking suggestions for improvement. Of course, it has a commercial motivation—Google wants its phones, browsers, operating system, and other products to reach every crevice of the globe. Yet no one else has volunteered for this expensive, long-term task, which has led Google and its font foundry contractors to cut a swath in the linguistic tall grass for others to follow.

What it takes to show a script

There’s nothing easy about getting letters, symbols, logograms, and other characters—collectively known as “glyphs” in the type world—to display correctly on a screen. Or, more accurately, across many kinds of screens, running many different versions of dozens of operating systems, and in hundreds of thousands of apps.

You may know the story of ASCII, an early 7-bit method of representing Roman letters, numbers, some punctuation, and a few other characters within 126 available positions that also had to include control codes used for teletypewriters. (Control-G may still ring a bell somewhere.) Work on ASCII started in 1960, and it became almost universally supported. Grizzled geeks like me may even be able to recite a hexadecimal code for some characters. But ASCII wasn’t enough: It didn’t even adequately encompass most European languages, much less the rest of the world.

Companies, countries, and committees of engineers pushed forward with competing, often incompatible, standards for encoding characters over decades. This ultimately led to the formation of the Unicode Consortium in 1991. Its original board included key members of every major operating system maker at the time.

Unicode’s goal is to represent every script in every written language—along with mathematical symbols, dingbats, and other material—with a unique number called a “code point.” Every code point is universal, meaning the same thing across all operating systems and all purposes.

While Bill Gates threw his support behind Unicode early on—Unicode was supposed to be fully integrated into the initial release of Windows 95—it took a long time for all layers in major operating systems, as well as libraries and software packages incorporated from elsewhere, to grok Unicode from top to bottom.

Even as operating systems shifted towards Unicode compatibility in the late 1990s, the web held it back. Both early versions of HTML and early browsers made it difficult to encode text correctly in the most popular Unicode format, UTF-8. Many years’ worth of sites were then built around either ASCII or a more extensive Western European character encoding, and today continue to act as a powerful inertial force.

Switching to UTF-8, for instance, could involve re-encoding entire databases, tearing code apart that needed to count characters, and finding or developing libraries that know Unicode inside and out. (This writer spent years trying to get one in-house mailing list package he built able to use Unicode in the subject line.)

According to Google’s measurements of crawled pages as of 2001, about 60 percent of pages it retrieved were in ASCII and 25 percent in Western European formats. The rest was a mix of Korean, Japanese, Cyrillic, Chinese, and other encodings. By 2010, the two leading encodings had dropped to under 20 percent each, and Unicode was nearly 50 percent—rising to over 60 percent by 2012. By late 2018, W3Techs, which tracks web and browser features, found over 90 percent of websites used UTF-8, and the number continues to rise.

The Tulalip Tribes, a Washington State-based association of Snohomish, Snoqualmie, Skykomish, and other allied American Indian communities, have seen the advantage of this improvement. Their language, Lushootseed, first developed a written form in the 1970s when native speakers—Vi “taqwšəblu” Hilbert in particular—collaborated with a linguist, Thom Hess, to record it. Work like this is especially critical because of the suppression of Native American languages in mandatory boarding schools run by the U.S. and Canadian governments between the 1870s and as late as the 1960s. Students could be punished, even beaten, for speaking their mother tongue.

Hess and Hilbert used the International Phonetic Alphabet (IPA), which was available on specialized typewriters, to record Lushootseed. Yet in the digital age, IPA didn’t initially translate well. It required the installation of special software and fonts until Unicode incorporated it and itself became widespread. Even then, Unicode-compliant IPA typefaces weren’t designed for many languages that employed it. The tribes commissioned a typeface in 2008 that would reflect its needs. This Unicode-compatible typeface coasted into computers and then mobile phones in the years that followed, as robust Unicode support among operating systems rose. Lushootseed’s typeface now requires no special effort to view and little effort to tap or type in.

The Tulalip Tribes’ language program head, Michele Balagot, said that Lushootseed has become more popular among tribal students, who learn it as part of their cultural education, because they can post in the script on Chromebooks, on Facebook, and in text messages. It’s also brought older speakers to computers, because they can type and text in a language they know. The tribes’ typeface provides a glimpse of what it looks like when expansive access to fonts appears where none existed before.

Lushootseed has benefitted from Unicode’s ever more robust support across operating systems and applications. But Unicode’s enumeration of characters is in essence an abstraction notion more than an instantiated reality. Every Unicode character you want to type or read on a screen has to be part of a font, a digital collection of unique characters mapped to Unicode points on it’s chart.

Most computers and mobile devices come with fonts installed that have large character sets that contain glyphs covering the most popular (and some less popular) scripts. But there’s still a gap between what Unicode describes, what speakers of lesser-used languages need, and what’s available.

It’s not a trivial matter to make digital typefaces for every Unicode character. To understand why, we have to rewind to about 1450.

Gutenberg had it easy

Johannes Gutenberg didn’t invent movable type—the concept of individual letters, numbers, punctuation, and spaces that could be set one at a time, rearranged, printed from, and reused. This kind of type far predated his first books, which were printed around 1450 in Mainz in what is present-day Germany. Whether he knew it or not, books printed from ceramic, metal, and wood-block type had already appeared in China, Korea, and Japan starting hundreds of years, if not millenia, earlier.

His invention, a hand mold, could take a “matrix”—an impression created from a punch, which was a hardened steel stick emblazoned on one end with a carving— and use it to produce hundreds to thousands of duplicates. The matrix allowed for the uniform creation of type, in his case simulating black letter “gothic” scribal writing.

Yet even with that invention, Gutenberg may have gotten nowhere fast if he hadn’t had the advantage of the Roman (or Latin) script. At the time, Latin and German had fewer than 26 characters in regular use, plus diacritics. Chinese, Korean, Vietnamese, and Japanese required at least thousands of logograms for literary and bureaucratic writing, which was still only a fraction of all characters in those languages. (Unicode currently includes nearly 90,000 entries for a unified set of logograms for those four languages.)

That meant that movable type by pioneering publishers in those languages was highly limited, with blocks more likely carved or engraved as needed than created in the punch mold-casting sequence. Most books printed as multiples relied on full carved woodblocks—it was simply more efficient.

Almost six centuries after Gutenberg, that advantage for Roman and similar script families, like Cyrillic and Greek, persists. For handwriting, people who speak languages with enormous character counts must learn to write them, but the hand is infinitely malleable, and a writer can produce any element of the language at will.

Not so for characters that need to be reproduced, whether on a printing press or a screen. No matter whether you’re creating a typeface to be cast in lead or displayed on-screen, each discrete character must be designed individually by someone who combines technical know-how with a knowledge of legibility and aesthetics.

A designed character must also fit both within the look and feel of other related parts of a font to create a typeface. And it has to match the fuzzy, archetypal sense that native speakers of a language have about the script they use, which lets them recognize glyphs across a wide variety of styles and variations.

Type designer Matthew Carter—of Georgia, Bell Centennial, and other iconic faces across more than six decades—told me in 2010, “I can’t wake up one morning and say, ‘Screw the letter B.’” He noted that “what we work with had its form essentially frozen way before there was even typography.”

And that’s just with the Roman alphabet. Kamal Mansour of Monotype, who represents the company at the Unicode Consortium, said that some Southeast Asian scripts are spoken by so few people that they don’t have the resources to develop their own fonts. And, he says, a given language might be written two or three different ways. Monotype’s Matteson noted that those forms often result from native speakers writing with different writing tools or techniques. Two people sitting next to each other might write a script differently, or its appearance could evolve separately in different parts of the world in which it’s spoken.

That’s true of some scripts with larger populations, too. For its Noto project, Google gathered employees and outside reviewers across several languages, who spoke and wrote in the same mother tongue. The Noto team at times had to become mediators, facilitating discussions among reviewers and designers to find an ideal version of an alphabetic item.

Noto bene

Noto set what Google thought was an ambitious but relatively simple goal: to produce a unified typeface that would aid the firm across all its purposes while providing minority languages with high-quality and consistent digital fonts. (The fonts in the project are free and open-source licensed, and can be modified and distributed without requiring permission.)

At the first production release of Noto in 2016, which covered all characters to Unicode 6.1, a Noto product manager told me that, as engineers, he and others didn’t fully understand the complexity of what they had signed up for. But they soon learned. And, given the time, work, and people power already expended, the project has easily cost many millions of dollars so far.

Google hired Monotype as the project lead, and hundreds of designers across Monotype, Adobe, and other foundries have worked on the font. These foundries already had experts and consultants for many scripts, but typically only covered the most popular few dozen. That led designers and engineers to turn partly into cultural anthropologists.

The main Tibetan script, for instance, is written by an estimated 5 million people and is about 28th on the list of scripts by active use. Yet there was neither an established, canonical digital form of the language nor even a strongly shared idea about what it should look like. An ex-monk in California that the team consulted found early drafts exciting, but his former brethren in Tibet found their modernity less acceptable. A Google project manager said they eventually found compromises.

In a more complicated case, Mansour said that written Armenian had exploded into different experimental type styles since the end of the Soviet Union. As a result, it seemed impossible to find a single style that would please multiple opinionated camps. But Monotype found a young designer who had studied Armenian typography and spoken to experts, and who ultimately created a Noto rendition rooted in older traditions with a modern look. “With this, we presented a face that’s more neutral,” Mansour said. It “doesn’t necessarily reflect current trends but doesn’t contradict them, either.”

Unicode 11 now incorporates elements and symbols needed for 146 scripts. Although that seems like a small number relative to the total number of written languages, remaining scripts reflect ever smaller groups of speakers or historians— sometimes on the order of thousands.

Indeed, some of Noto’s work has nothing to do with fonts, but with technology haves and have-nots. Older phones and computers compatible with earlier releases of Unicode can’t load and use Noto for more recently added languages: The fonts exist, but not the support. “In many markets, people can’t afford the latest phones. There’s an artificial waiting period introduced just because nobody goes back to refit,” said Mansour.

For those in technology and type design, giving voice to languages on-screen that have been silent or difficult to express marks a distinct difference from the usual work of compatibility and legibility.

“Never in my wildest dreams did I think I’d have to learn something like 100 scripts in five or six years,” Mansour reflected. And encountering the grateful reactions of those finally seeing a representation of their language on-screen has been powerful to him.

The impact of a complete and well-designed set of glyphs for one’s language is a hard thing to measure, especially for small populations. But Unicode’s continued efforts towards minority languages and Google’s expensive inclusive effort will ensure one thing: People who want to write in their own language will have at least one font at their disposal, and a well-designed one at that.

Also in this issue

The wonderful world of emoji ✨

What it takes to show a script

Gutenberg had it easy

Noto bene

About the author

Artwork by

Topics

Buy the print edition

Continue Reading

Programming Languages

Glenn Fleishman

It’s COBOL all the way down

Documentation

Glenn Fleishman

“How-to” build a civilization

Security

Glenn Fleishman

Free certificate authorities and the rise of the encrypted web

Open Source

Glenn Fleishman

A license to share

Software Architecture

Glenn Fleishman

In space, no one can hear you kernel panic

Testing

Glenn Fleishman

Interview: Dorothy Graham

Containers

Glenn Fleishman

Interview: Joe Beda

Development

Glenn Fleishman

Interview with Isaac Z. Schlueter, CEO of npm

Energy & Environment

Glenn Fleishman

Interview with Ramez Naam, futurist, author, and energy tech investor

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call