The wonderful world of emoji ✨ – Increment: Internationalization

It started in Japan over two decades ago as a way of using leftover space in a character database. Now, it’s exploded into a worldwide phenomenon with such global saturation that multiple styles of poop slippers are available at almost any local market. It’s emoji. And its abundance and popularity is making Unicode more relevant than it’s ever been.

Languages made universal (and multiplanar) 🗺️🌐🌌

Because computers, we have Unicode: a universal encoding standard that’s designed to have enough space to store every code point of every human written language and then some. Unicode is a superset character encoding standard that incorporates and is backwards compatible with the biggest early encoding sets: ASCII (circa early 1960s), which allows 128 characters and covers most characters you need as an English-speaking American; and “extended ASCII” (a generalization, but circa late 1970s), which allows 256 characters, extending its usability to Western European languages.

However, extended ASCII still left all the characters needed to write in a non-Latin alphabet-based language missing. Alternate encoding systems sprung up to address this, but—as computers started to connect to each other around the globe with the dawn of the internet—disparate encoding systems created more issues than they solved. Computers needed to communicate with each other using a standard encoding that could encompass all the characters any written language would require. And so, in 1991, the Unicode Consortium was founded and Unicode created.

Characters in Unicode are often denoted in UTF-8 encoding, with between four and eight hexadecimal characters (0–9, then A–F). Here, we’ll be naming specific characters in the following form:

A U+0041 LATIN CAPITAL LETTER A

This represents: the character itself, its UTF-8 code point, and its Common Locale Data Repository name (or CLDR name, or just name).

Unicode allows for over a million different code points across multiple blocks on different planes. Planes are major sections—Basic, Supplementary—while blocks allow for semi-logical separation of different types of scripts, keeping characters of the same scripts together where possible. (Because characters get added over time, sometimes there’s no space left in the block, and they have to be added to a different block.) In the end, a code point has a unique identifier that describes it, no matter where on a given plane it resides. LATIN CAPITAL LETTER A will always be U+0041.

ASCII and ANSI—and most modern languages—exist in the Basic Multilingual Plane: U+0000 through U+FFFF. All these characters can be described in a single byte of memory. Anything beyond this plane is a multibyte character, requiring more than one byte to store a given code point value. Emoji live in the Supplementary Multilingual Plane, from U+10000 through U+1FFFF. Many other scripts exist in blocks on this plane, including hieroglyphics ( U+13080 EGYPTIAN HIEROGLYPH D010) and cuneiform ( U+12000 CUNEIFORM SIGN A). There are other planes, but they are sparsely populated—for now.

Security = 😀

A million characters. A globally recognized standard. Space to encode the alphabet of every written human language.

It sounds magnificent, but most people don’t get to see its gravity. The alphabets that appear earlier in Unicode—the ones that are readily available to most everyone—cover the most popular languages. Alphabets deeper in the specification are used by fewer people. Why would you update the version you use, or the way you use it, when you personally have everything you think you need?

Emoji are inarguably Unicode’s most popular set of characters. For many people, emoji are the sole reason they care that their computers and social media apps and websites support the latest version of Unicode and, in turn, multibyte characters such as emoji. Perhaps this shouldn’t be the case—surely we should care about access and users beyond ourselves—but this is how a non-zero amount of education about the issue happens. If it weren’t for the popularity of emoji, many people wouldn’t care.

With emoji as the driving factor, consumers are much more likely to apply software updates. When “emoji day” happens, now an annual event during which Apple adds the annual set of 100+ emoji to iOS and macOS, iPhone owners clamber to run their updates so they can send the U+1F99C PARROT to their friends. They’re also, perhaps unwittingly, patching their phones with the latest security updates. It’s a remarkable phenomenon. While, for example, an app advertising a new update as having “various bugs and fixes” may not be immediately updated, the new system update with over 100 new emoji has users actively—even eagerly—installing it. Emoji, like the toy at the bottom of a cereal box, are an impressively effective feature for marketing major version updates.

Emoji have affected real change in platforms. WordPress, which runs at least 20 percent of the world’s websites, made a major update to their systems in 2015 under the guise of “enabling emoji support.” What they actually did was patch a critical security vulnerability that allowed cross-site scripting attacks in some multibyte character situations. In essence (and this is only a tiny exaggeration): a quarter of the internet was saved from hacking by adding emoji support.

How an emoji is made

Emoji are a completely valid way to drive curiosity about Unicode. Since starting to research and speak and write about how emoji work, I’ve become a Unicode nerd. The mechanics around not just the technological processes of how new emoji are made but also the human processes are truly fascinating. Did you know that you can petition for your own emoji—a full-fledged addition to the universal standard? It’s a long process, but it is open to the public. The Unicode Consortium’s proposal guidelines detail everything—compatibility, distinctness—that will help a proposed emoji’s case, as well as a number of things that won’t—being overly specific, open-ended, or a logo or representation of a brand.

a HyPhEn iS InVaLiD iN SuRnAmEs

As soon as support exists for multibyte characters, every other code point in Unicode has support. (This assumes, of course, that your system has a font that can recognize the character and display it, which isnʼt the case for all systems. The cuneiform examples from earlier in this article, while valid Unicode characters, arenʼt always supported.) Supporting multibyte characters is an imperative feature—not just to indulge our love of emoji, but because it impacts many people on deep and intrinsic levels.

Riddle me this: How would you feel if you werenʼt able to type your own name? If that name—your personal identifier that makes you “you”—confounded every user interface you tried to employ? What if you were told that you needed to use a different series of letters or characters to identify yourself—not because people couldnʼt understand your name, but because computers couldnʼt?

There are countless examples of name tags reading “Ren√©” instead of “René,” error messages on forms shouting that the surname “Collette-Ryans” is invalid because a hyphen is invalid in surnames. In more extreme cases, there have been visas issued to people that—due to the encoding issues—donʼt exist. Arbitrarily limiting anyone with, say, a single quote in their surname—Nyongʼo, OʼBrien—because you canʼt be bothered to fix it actively repels your users from your system. Not everyone can type even their own name without full access to Unicode. Designing systems accordingly helps ensure that people are able to use your systems.

An interesting part of these selection factors for inclusion is factor A: compatibility. The reason we have 🤠 U+1F920 FACE WITH COWBOY HAT as an emoji is because it was an emoticon in the popular 1990s instant messenger service Yahoo! Messenger. Indeed, all of the Microsoft Wingdings and Webdings characters are now formally included in Unicode due to this compatibility factor, which explains how such “out of place” emoji as 🕴 U+1F574 MAN IN BUSINESS SUIT LEVITATING are now available.

However, once an emoji is added to Unicode, it takes a while for various major vendors to make this change. For example: Apple (the vendor) needs to update iOS (the operating system) for the iPhone (the device). If your device still receives vendor updates, it often notifies you when they are available, and installing this update gets you, among other things, the latest emoji.

Have you ever wondered why Twitter has the latest and greatest emoji before anyone else? That’s because they use their own emoji designs, called “Twemoji,” and they use these images in place of whatever system emoji a user might have installed. This way every user on their platform can see the same emoji, regardless of the device they’re using. Twitter has also open-sourced some of this in a useful JavaScript package. It’s not a complete solution, as it doesn’t embed the CLDR names database to describe the emoji being shown, but there are other open-source solutions for this edge case. Twitter also has the advantage over Apple and Google here: Updating a website is far easier than updating an operating system.

A force for inclusion and access 🤝

The Unicode Consortium does more than add new species of birds to your phone’s keyboard once in a while. Humans have created so many writing systems that there are still new ones being added into Unicode every year. In the last year alone, Unicode 11.0 saw the addition of Dogra, Gunjala Gondi, Hanifi Rohingya, Makassar, Medefaidrin, Old Sogdian, and Sogdian scripts. The Consortium isn’t able to do this sort of unique work without the support of financial sponsorship, a significant portion of which has been provided by the Adopt a Character program. Individuals and companies are able to give a donation in order to adopt a character—not just emoji—and have their donation listed on a dedicated page. (Warning: large page load.)

Both directly and indirectly, the popularity of emoji is supporting the work the Consortium does to increase the coverage of Unicode to support all written language; work that fosters inclusion and access, and ensures the ability to digitally encode all human writing for generations to come. ✨

Languages made universal (and multiplanar) 🗺️🌐🌌

Security = 😀

Check minus

How an emoji is made

a HyPhEn iS InVaLiD iN SuRnAmEs

A force for inclusion and access 🤝

About the author

Artwork by

Topics

Buy the print edition

Continue Reading

Cloud

David J. Lumb

The U.S. Government’s long road to adopting the cloud

Documentation

David J. Lumb

Inside the complex world of life-saving software

Security

David J. Lumb

The story of Signal

Security

Chris Stokel-Walker

The mystery of steganography

Security

Glenn Fleishman

Free certificate authorities and the rise of the encrypted web

Internationalization

Opemipo Aikomo

The process: Building a better checkout

Internationalization

Glenn Fleishman

For want of a typeface

Internationalization

Michael Thomas

Puerto Rico starts up

Open Source

Chris Stokel-Walker

Voting for transparency

Explore Topics

All Issues

Planning

Mobile

Containers

Reliability

Remote

APIs

Frontend

Software Architecture

Teams

Testing

Open Source

Internationalization

Security

Documentation

Programming Languages

Energy & Environment

Development

Cloud

On-Call