Depending on where you’ve lived you’re probably using at least one of WhatsApp, WeChat, LINE, or Kakaotalk. At first glance, these apps look like communication tools. They allow you talk to your friends, family, coworkers and business partners. Sometimes all at once. But messenger apps are just the beginning of a much larger trend. Pretty soon, you’ll use such apps not to talk to your friends, but to talk to machines. Chat applications will become a new interface used to consume information and services. In fact, many companies have started to move into this direction. We’ll get to that later, but bear with me for a little while.
Let me clarify what I mean by interface first. For the purpose of this post, an interface is a layer of technology that facilitates the exchange of information and services between humans and machines. The perhaps most ubiquitous interface today is the web browser. We’ve invented several layers of complex technologies: HTTP, HTML, CSS, Javascript, and many others, to make the web browser interface work well. The same is true for mobile apps, which share a lot of the same principles. Some people argue that a browsers and mobile apps are fundamentally different, but I disagree. There are subtle differences, but you’re essentially looking at a screen and clicking or touching stuff. They’re quite similar. In both cases, we use them to communicate with other people (chat apps), find information (Google), and consume services (Amazon).
By design, all interfaces impose constraints on their users. Some constraints are good – they force us to focus on the essential, making us more efficient. But Some are bottlenecks to our productivity, preventing us from doing what feels natural to us. Take search as an example. Humans are quite good at asking questions. We’ve been doing that for thousands of years. But can you have a conversation with Google to get the information you need? No, you can’t. (Okay, except for a tiny fraction of very simple and commonly asked questions). Instead, we’ve adjusted to the interface that Google is providing us. We’re entering keywords that we think Google will be able to efficiently utilize to give us what we want. Most of us are doing this unconsciously already because we’ve been conditioned to do so. But if watch your mom, who didn’t grow up with Google, you’ll see that she has a hard time wrapping her head around what “good” keywords are, and why she can’t find what she needs. She also can’t tell spam for non-spam. Keyword search doesn’t come naturally to her.
Now, there’s an alternative interface that we use to communicate, get information and consume services almost every day: Natural Language. Natural Language is nothing more than the transmission of information, or meaning, through a medium such as sound or script. That’s also why I won’t make a distinction between speech and text-based solution in this post. They both have their place, but they embody the same thing. Language is an interface that, right now, mostly facilitates communication between humans. But you could imagine using this same interface to facilitate communication between humans and machines. Instead of putting keywords into Google you’d ask a question, potentially have a quick conversation, and get a response. Instead of clicking through a dozen app screens you’d just say what food you want to order, or tell your car where you want to go.
Just like other interfaces, the natural language interface has constraints. It isn’t suited for everything. Obviously it’s bad for visual tasks – shopping clothes for example. It also doesn’t seem so great for discovery, like browsing items on Amazon without a clear goal. But you can certainly imagine use cases where natural language excels. Asking questions to get information or conveying tasks with a clear intention – ordering a specific food item, cab, or giving directions for example. Natural Language places a low cognitive load on our brains because we’re so used to it. Hence, it feels more natural and effortless than using an app or browser.
Of course, Facebook (M), Google (Now), Apple (Siri), Microsoft (Cortana) and Amazon (Echo) have been working on natural languages interface for a while now. They’re calling it personal assistants (PAs). However, there’s another group of companies who are attacking the same problem from a different angle, but we’ve been ignoring that part of their business. Messenger apps. These platforms are uniquely positioned to enable communication with machines. The leader of the pack is probably WeChat, which allows you to order everything from food to taxis from within the app, using a messaging as the interface. Slack is moving into the same direction by encouraging the creation of various bots. LINE also has a range of robotic accounts, ranging from translation services to erotic entertainment.
Why do I think these two groups of companies are working on the same problem? PAs follow what I would call a top-down approach. They’re trying to solve a challenging problem with a long history in AI: Understanding your intentions and acting on them. They are trying to build general purpose software that is smart enough to also fulfill specific tasks. It’s a research area with lots of stuff left to figure out. Messenger apps follow a bottom-up approach. They are starting with simple bots (“Hi, translate this sentence for me”), solving very specific problems, and are gradually moving towards more sophisticated AI. Over time, these two approaches will converge.
When it comes to the adoption of natural language interfaces, messengers apps may actually have a leg up. Here’s why. Talking to Siri on a crowded train still feels awkward, even in SF. It’s also annoying because you have to open yet another app. However, many of us are spending our time in chat apps anyway, so it feels completely natural to just add conversation with a bot. Ditto for Slack. The transition for consumers seems so much smoother. As conversational interfaces penetrate our world, voice interfaces will start feeling just as natural as taking selfies (and hopefully more than selfie sticks). But another reason for why messenger apps are well positioned is data. These companies have collected huge amounts of conversational data that can be used to train better Natural Language models. Facebook has that too, both with their messenger and with WhatsApp. Microsoft doesn’t. Amazon doesn’t. Apple may or may not, depending on how secure iMessage really is.
If you’ve actually used any of the PAs you may be skeptical. Siri still barely understands what you want, and Facebook has put hordes of human workers behind M to get it to do anything useful. How will these things ever replace all the complex tasks we’re doing apps and browsers? That’s another reason for why the bottom-up approach seems promising. It yields immediate benefits and practical applications.
But I also believe that over the coming years we will see rapid improvements in conversational interfaces. There are clear enabling technologies for this trend: Natural Language Processing (NLP) and Deep Learning. Deep Learning techniques have led to breakthroughs in Computer Vision, and they are now penetrating natural language research. Whether or not Deep Learning itself will lead to breakthroughs in NLP is hard to say, but one thing is clearly happening: Many smart people who didn’t previously focus on NLP are now seeing the potential and are starting working on NLP problems, and we can surely expect something to come out of this.