If ancient explorers spoke to the stars for guidance, or imagined conversations with mermaid companions to pass the time, then modern incarnations take the form of Google Assistant, Amazon Alexa, or Apple’s Siri. These chatbots are a new kind of user interface: They make complex systems more accessible by leveraging the intuitive nature of language, as opposed to the awkwardness of buttons, pedals, knobs, and switches. Though developer tools and AI technologies have gotten significantly better in recent years, the success of a chatbot depends largely on its execution. Developing a principled conversational interface requires designing for accessibility, supporting internationalization, ensuring intuitive visual cues, and adapting to hardware variability. And that’s just the tip of a mysterious iceberg that has yet to be fully understood. Let’s dive deep (via submarine, of course) to uncover the tremendous complexities that lie beneath the surface.
As we plunge down, we’ll see the wreckage of past ships: Microsoft’s Tay, Facebook’s M, and Tencent’s Baby Q. While some cautionary tales may warn explorers away from these efforts, the bold can salvage data sets, reclaim techniques, and, most importantly, use their lessons to raise morale. It’s time to author bots in ways that go beyond traditional machine learning habits of crunching a large text corpus to squeeze out dodgy dialogue, and steer away from antiquated flowcharts emulating finite state machines. Advanced dialogue systems let developers define high-level objectives for the system to achieve, balancing the power of autonomous reasoning with the interpretability of hand-designed goals.
Chatbots have gone through a boom and bust cycle in the past few years. We’re now on the cusp of another boom, as the authoring pipelines for Facebook, Skype, Slack, Google Assistant, and Amazon Alexa have all started to converge on a common architectural framework. For example, Google’s Dialogflow, Microsoft’s Bot Framework, Amazon’s Alexa Skills Kit, and IBM’s Watson Assistant all provide more or less the same AI modules (which we’ll discuss in detail later). Moreover, it’s easier than ever to deploy chatbots to third-party platforms. As we ramp up to the next boom period, we have an opportunity to capitalize on the industries of personalized tutoring, automated call centers, and smart assistants by using intelligent conversational systems.
If the last decade has taught us anything, it’s that artificial intelligence can’t be owned by any one corporation, contrary to what some had feared; every company seems to have a few tricks up its sleeve. Amid the AI excitement, frontend developers finally have some clarity around the practice of writing chatbots, outlined in the following four pillars:
Dialogue logic: formally writing actions and goals
AI modules: understanding unstructured data through off-the-shelf solutions
Relationship management: storing prior conversations
Kernel: gluing components together in a serverless architecture
Call it DARK for short—an appropriate acronym for our undersea expedition.
The first, and arguably most important, pillar is the dialogue logic, which is a formal description of actions and goals. It’s the principle that allows us to author conversations that last longer than just one turn. Whereas the dialogue logic defines the “rules of chess,” the AI modules fuel the engine that plays the game. Relationship management is the process of cultivating a bond between the chatbot and the user through persistent storage of information between sessions. Lastly, the kernel is the piece of code you write to connect the system to the user.
Pack your life jackets and clean the periscope; we’re about to shine a light on each of these frontend components in the next few sections.
Authoring dialogue that’s many-branched and capable of handling the unexpected is as convoluted as writing software.
Authoring dialogue that’s many-branched and capable of handling the unexpected is as convoluted as writing software. While a weathered programmer may use a simple text editor to script an algorithm, most software engineers appreciate the benefits of static code analysis, assistive refactoring, and debugging support in an integrated development environment. Similarly, content writers architecting a conversational experience can use all the tools they can get. It takes a tremendous amount of time and iteration to produce dialogue that feels just right, so sinking time and expense into an authoring tool is often a reasonable investment.
The ideal authoring tool enables web developers, content writers, and AI researchers to collaborate and share conversational experiences. However, most dialogue systems serve interactive experiences in their own domain-specific and often proprietary language, creating a fragmented zoo of formats. For example, Google Assistant and Amazon Alexa chatbots don’t share a common intermediate representation, which makes it impossible to write content that can be made available to a mass audience. (According to Nasdaq, Amazon represents 36 percent of the global share of smart speakers, followed by Alibaba, Baidu, Google, and Xiaomi, which divide up the remainder in roughly equal segments.) Imagine being locked out of your work due to an expired software license!
With that in mind, developers should avoid tightly coupling the dialogue logic with the system runtime. An open standard for dialogue logic allows you to seamlessly import and export content between platforms, similar to how Scalable Vector Graphics serves as an open standard for vector images. Regardless of the backend services, the dialogue logic can be independently designed, tested, and shared. The W3C Conversational Interfaces Community Group’s Dialogue Manager Programming Language (DMPL) report, which I edited, puts it this way: “An agreed upon representation of dialogue allows content writers to author and share conversational experiences without being distracted by the underlying runtime.”
Two convenient representations for defining dialogue policies draw on JSON and XML. The first, Artificial Intelligence Markup Language (AIML), is an XML-based format. It was originally announced two decades ago; today, it has significant limitations. AIML tangles up natural-language understanding, or NLU, with the dialogue logic. As a result, authoring complicated dialogue systems in AIML is a test of a content writer’s determination to cover edge cases with hand-designed rules, effectively taking the “AI” out of AIML.
On the other hand, DMPL is a JSON representation for dialogue flow and control that doesn’t concern itself with text pattern–matching rules. It avoids conflicting with the NLU, and, as the W3C’s DMPL report states, “relieves content writers from having to write complex pattern-matching expressions.”
Consider a bot for scheduling meetings with clients. This conversation is typically a back-and-forth negotiation to identify mutually agreeable times while optimizing for personal preferences (e.g., don’t schedule meetings early in the morning unless absolutely necessary). The intents that the bot and client may express fall into the following categories: propose a time of day, propose a day of the week, or complete the booking. These are the actions available to the autonomous agent, which are authored formally (in DMPL) so that the DM can sequence the actions intelligently.
The DM requires a formal specification of both the available actions, which define what can be said and when, and the goal, which contrasts desired outcomes with undesirable ones. In our scheduling example, the desired situation is that both parties agree on a time without too much back and forth. The DM extrapolates from preferences, ranked by the author, to identify high-utility situations that can be reached by sequencing the available actions.
This dialogue logic, written in a formal language, enables the runtime to plan for an action that maximizes the expected utility by thinking many steps ahead, as though playing a game of chess. The reasoning engine gets us closer to handling long-term dialogue.
In the next section, we’ll take a step back to see where the DM fits into the overall AI system.
Circa 2012, the influence of machine learning spread under the (false) guise of artificial intelligence. The ability of deep neural networks to detect speech, understand text, make decisions, and classify images benefited from the gold mine of data craftily siphoned and meticulously organized by household-name corporations. Data helps machine learning algorithms fill in gaps, but artificial intelligence promises more than just inductive learning.
Let’s start with one of the most common modules in conversational interfaces: automatic speech recognition (ASR), which provides developers with a more useful representation of audio by converting it to text—in part because text processing has more mature tooling than audio processing. When talking into a microphone, a 10-second audio clip may take up around 100 KB of data. Assuming there are on average five letters per word in the English language, then 100 KB is the amount of data in roughly 10 times the words in this article. Typically, ASR modules output a natural language text representation of the words spoken in an audio clip. This process is certainly lossy, but the hope is that the relevant information to a conversational experience persists. Modern ASR modules can analyze a person’s tone of voice to recognize emotion and nonlinguistic sounds, such as laughter, but handling sarcasm is still a challenge.
Since ASR converts audio into text, we’ve merely shifted our focus from understanding audio signals to understanding natural language in a text format. Imagine feeding an algorithm all of the sentences in this paragraph—could it process the meaning behind all of the letters and punctuation? That’s where the field of natural language processing (NLP), specifically natural language understanding (NLU), comes in—to translate text into more manageable data structures. Usually, NLU modules bucket text into categories. For example, the utterances “hi,” “hello,” and “hey,” might all be categorized into the intent “greeting.” Longer sentences may contain other useful information, and the process of plucking phrases out of a text is called “entity extraction.” Thanks to NLU, programmers have a way to deal with most of the complexities involved in natural language.
The ability to figure out where the chatbot is in a dialogue and what to do next is dictated by the DM—the logic, reasoning, and planning engine of the conversation. The DM manipulates symbols (i.e., intents) and outputs intents for the user. The “greeting” intent, for example, may be sent to the DM, and in return the DM may generate a “greeting” intent back to the user. The module that turns an intent back into a user-friendly natural language text is called natural language generation (NLG). Lastly, a module called text to speech (TTS) synthesizes a voice to speak the text aloud. Modern TTS modules accept text marked up in a format called Speech Synthesis Markup Language (SSML), which allows phrases within an utterance to be spoken with a certain emphasis, at a different pitch, or with a different emotion.
No single module is sufficient to make a good chatbot, but proper assembly goes a long way. The code responsible for carefully orchestrating the modules and piping information between them is what we call the kernel. In order to fully appreciate the details of the kernel, we must first explore the concept of relationship management.
Imagine walking into your favorite coffee shop, greeting a barista you recognize, and being asked, “Let me guess—a medium Americano?” Even if the drinks aren’t as good as the service, you’re loyal because of the relationship this experience fosters.
Managing relationships beyond a single session requires both a data structure for representing knowledge and a method of persistently storing it.
Managing relationships beyond a single session requires both a data structure for representing knowledge and a method of persistently storing it. I encourage you to start simple and optimize later. For example, knowledge may be represented in formats like a property graph, RDF, or Chunks, but the simplest method is a list of key-value pairs (e.g., where “favoriteDrink” is the key, and “Americano” is the value). The same goes for databases: Start simple by using one of the many serverless database solutions, like Firebase or Parse, which allow frontend developers to manage data securely without needing to build and deploy backend code.
A user’s data is part of an overall dialogue context called the information state (IS), which effectively makes the dialogue system stateful. In a DM, these IS values are called variables, just like the variables in traditional programming. The IS variables get passed into the DM so that the dialogue system can take relevant and personalized actions.
The kernel is the utmost privileged module in any OS that controls access to memory, CPU, networking, hardware peripherals, and more. Writing a serverless frontend application for juggling audio, text, and symbolic states through a pipeline of AI modules in a real-time system bears uncanny similarities to writing a kernel for an operating system. Similar to the way an OS schedules processes, the AI system schedules what should be said and when it should speak, balancing resources between competing AI modules.
Some low-level duties of the kernel involve managing input devices. Access to the user’s microphone from the browser, for example, unleashes a stream of audio data, represented as a sequence of bytes. Automatic speech recognition APIs (from Google Cloud or IBM Watson, for instance) consume the stream about 20 milliseconds at a time. Most ASR services asynchronously respond with some intermediate results until deciding, through voice activity detection, that the audio is finished, and conclude with a final result.
Our frontend code (i.e., the kernel) is then also responsible for piping the output of ASR into NLU. Sometimes an intermediate result from ASR is good enough and we can forward that to NLU without waiting for the audio to finish, but the fancier you make it, the more you’ll open yourself up to untested edge cases. There’s no shortage of NLU services for frontend developers: Googleʼs Dialogflow, Amazon Lex, Microsoft LUIS, and more.
The DM runs continuously in the background (in a web worker thread) from the moment the kernel boots the “operating system.” A task-oriented programming language, such as DM Script, may be used to define the dialogue logic. Intents classified by NLU and the entities extracted from the input utterance are sent to the DM, which handles these incoming events by updating information-state variables. The DM responds to the user at its own pace, at what it determines is the right time to speak, using the policy authored by the content writer.
When the DM is ready to publish an intent to the user, the kernel forwards the intent to the NLG module, which fetches the corresponding user-friendly natural language text. Often, the NLG is a lookup table, implemented on the frontend as a JSON structure, mapping intents to lists of utterances. More complex NLG modules handle filling entities in proper slots. Finally, many TTS services (IBM Watson, Google Cloud, etc.) may be employed to retrieve an output audio stream to speak the text aloud.
The voyage ahead
As AI modules continue to improve, so do the syntax and semantics of languages for defining the behavior of dialogue. DMPL is one such language that formalizes the representation of states (i.e., variables), how they change (i.e., actions), and which situations are preferred (i.e., utility), putting them together to categorize a task. The DM resolves the task by searching for actions that maximize the expected utility, a process reminiscent of traditional search algorithms taught in AI textbooks prior to the recent deep neural networks frenzy. As we look to a new decade, developers should expect—and welcome—an influx of conversational interface designers who specialize in authoring chatbots.