It started in Japan over two decades ago as a way of using leftover space in a character database. Now, it’s exploded into a worldwide phenomenon with such global saturation that multiple styles of poop slippers are available at almost any local market. It’s emoji. And its abundance and popularity is making Unicode more relevant than it’s ever been.
Because computers, we have Unicode: a universal encoding standard that’s designed to have enough space to store every code point of every human written language and then some. Unicode is a superset character encoding standard that incorporates and is backwards compatible with the biggest early encoding sets: ASCII (circa early 1960s), which allows 128 characters and covers most characters you need as an English-speaking American; and “extended ASCII” (a generalization, but circa late 1970s), which allows 256 characters, extending its usability to Western European languages.
However, extended ASCII still left all the characters needed to write in a non-Latin alphabet-based language missing. Alternate encoding systems sprung up to address this, but—as computers started to connect to each other around the globe with the dawn of the internet—disparate encoding systems created more issues than they solved. Computers needed to communicate with each other using a standard encoding that could encompass all the characters any written language would require. And so, in 1991, the Unicode Consortium was founded and Unicode created.
Characters in Unicode are often denoted in UTF-8 encoding, with between four and eight hexadecimal characters (0–9, then A–F). Here, we’ll be naming specific characters in the following form:
A U+0041 LATIN CAPITAL LETTER A
This represents: the character itself, its UTF-8 code point, and its Common Locale Data Repository name (or CLDR name, or just name).
Unicode allows for over a million different code points across multiple blocks on different planes. Planes are major sections—Basic, Supplementary—while blocks allow for semi-logical separation of different types of scripts, keeping characters of the same scripts together where possible. (Because characters get added over time, sometimes there’s no space left in the block, and they have to be added to a different block.) In the end, a code point has a unique identifier that describes it, no matter where on a given plane it resides.
LATIN CAPITAL LETTER A will always be
ASCII and ANSI—and most modern languages—exist in the Basic Multilingual Plane:
U+FFFF. All these characters can be described in a single byte of memory. Anything beyond this plane is a multibyte character, requiring more than one byte to store a given code point value. Emoji live in the Supplementary Multilingual Plane, from
U+1FFFF. Many other scripts exist in blocks on this plane, including hieroglyphics (
U+13080 EGYPTIAN HIEROGLYPH D010) and cuneiform (
U+12000 CUNEIFORM SIGN A). There are other planes, but they are sparsely populated—for now.
A million characters. A globally recognized standard. Space to encode the alphabet of every written human language.
It sounds magnificent, but most people don’t get to see its gravity. The alphabets that appear earlier in Unicode—the ones that are readily available to most everyone—cover the most popular languages. Alphabets deeper in the specification are used by fewer people. Why would you update the version you use, or the way you use it, when you personally have everything you think you need?
Emoji are inarguably Unicode’s most popular set of characters. For many people, emoji are the sole reason they care that their computers and social media apps and websites support the latest version of Unicode and, in turn, multibyte characters such as emoji. Perhaps this shouldn’t be the case—surely we should care about access and users beyond ourselves—but this is how a non-zero amount of education about the issue happens. If it weren’t for the popularity of emoji, many people wouldn’t care.
Emoji day is different from “World Emoji Day,” July 17, as denoted by the date on the 📅
U+1F4C5 CALENDAR emoji on Apple operating systems. (The reason why the default date on this emoji is different on different platforms is a story for another time.)
With emoji as the driving factor, consumers are much more likely to apply software updates. When “emoji day” happens, now an annual event during which Apple adds the annual set of 100+ emoji to iOS and macOS, iPhone owners clamber to run their updates so they can send the
U+1F99C PARROT to their friends. They’re also, perhaps unwittingly, patching their phones with the latest security updates. It’s a remarkable phenomenon. While, for example, an app advertising a new update as having “various bugs and fixes” may not be immediately updated, the new system update with over 100 new emoji has users actively—even eagerly—installing it. Emoji, like the toy at the bottom of a cereal box, are an impressively effective feature for marketing major version updates.
The widespread availability of emoji does yield some complications. For example, on Twitter, verified users are denoted with a blue checkmark next to their name—so Twitter has limited the characters allowed in display names, disallowing any emoji that look like tick marks or blue dots. In the web browser Safari, the padlock emoji is ignored in the tab of a web page so that users donʼt accidently assume the page is loaded securely.
Emoji have affected real change in platforms. WordPress, which runs at least 20 percent of the world’s websites, made a major update to their systems in 2015 under the guise of “enabling emoji support.” What they actually did was patch a critical security vulnerability that allowed cross-site scripting attacks in some multibyte character situations. In essence (and this is only a tiny exaggeration): a quarter of the internet was saved from hacking by adding emoji support.
Emoji are a completely valid way to drive curiosity about Unicode. Since starting to research and speak and write about how emoji work, I’ve become a Unicode nerd. The mechanics around not just the technological processes of how new emoji are made but also the human processes are truly fascinating. Did you know that you can petition for your own emoji—a full-fledged addition to the universal standard? It’s a long process, but it is open to the public. The Unicode Consortium’s proposal guidelines detail everything—compatibility, distinctness—that will help a proposed emoji’s case, as well as a number of things that won’t—being overly specific, open-ended, or a logo or representation of a brand.
a HyPhEn iS InVaLiD iN SuRnAmEs
As soon as support exists for multibyte characters, every other code point in Unicode has support. (This assumes, of course, that your system has a font that can recognize the character and display it, which isnʼt the case for all systems. The cuneiform examples from earlier in this article, while valid Unicode characters, arenʼt always supported.) Supporting multibyte characters is an imperative feature—not just to indulge our love of emoji, but because it impacts many people on deep and intrinsic levels.
Riddle me this: How would you feel if you werenʼt able to type your own name? If that name—your personal identifier that makes you “you”—confounded every user interface you tried to employ? What if you were told that you needed to use a different series of letters or characters to identify yourself—not because people couldnʼt understand your name, but because computers couldnʼt?
There are countless examples of name tags reading “Ren√©” instead of “René,” error messages on forms shouting that the surname “Collette-Ryans” is invalid because a hyphen is invalid in surnames. In more extreme cases, there have been visas issued to people that—due to the encoding issues—donʼt exist. Arbitrarily limiting anyone with, say, a single quote in their surname—Nyongʼo, OʼBrien—because you canʼt be bothered to fix it actively repels your users from your system. Not everyone can type even their own name without full access to Unicode. Designing systems accordingly helps ensure that people are able to use your systems.
An interesting part of these selection factors for inclusion is factor A: compatibility. The reason we have 🤠
U+1F920 FACE WITH COWBOY HAT as an emoji is because it was an emoticon in the popular 1990s instant messenger service Yahoo! Messenger. Indeed, all of the Microsoft Wingdings and Webdings characters are now formally included in Unicode due to this compatibility factor, which explains how such “out of place” emoji as 🕴
U+1F574 MAN IN BUSINESS SUIT LEVITATING are now available.
However, once an emoji is added to Unicode, it takes a while for various major vendors to make this change. For example: Apple (the vendor) needs to update iOS (the operating system) for the iPhone (the device). If your device still receives vendor updates, it often notifies you when they are available, and installing this update gets you, among other things, the latest emoji.
The Unicode Consortium does more than add new species of birds to your phone’s keyboard once in a while. Humans have created so many writing systems that there are still new ones being added into Unicode every year. In the last year alone, Unicode 11.0 saw the addition of Dogra, Gunjala Gondi, Hanifi Rohingya, Makassar, Medefaidrin, Old Sogdian, and Sogdian scripts. The Consortium isn’t able to do this sort of unique work without the support of financial sponsorship, a significant portion of which has been provided by the Adopt a Character program. Individuals and companies are able to give a donation in order to adopt a character—not just emoji—and have their donation listed on a dedicated page. (Warning: large page load.)
Both directly and indirectly, the popularity of emoji is supporting the work the Consortium does to increase the coverage of Unicode to support all written language; work that fosters inclusion and access, and ensures the ability to digitally encode all human writing for generations to come. ✨