The phone that answers back

The man sits down in front of the computer and says, affably: “Computer!” Nothing happens. In a now-hear-this tone, the man repeats: “Computer?” Still nothing happens. Puzzled, he picks up the mouse and speaks into it: “Hello, computer?” Beside him, the impatient owner says: “Just use the keyboard.” The first man replies: “A keyboard?” Then, slightly annoyed: “How quaint.”

The scene comes from the 1986 film, Star Trek IV, in which Scotty, the engineer, and the rest of the crew have flown back in time from the 23rd century.
Scotty needs to get some work done on the computer, and, of course, in the 23rd century they all work by voice command, unlike those 1980s throwbacks.

Yet if the crew were to land 35 years later, in the present day, Scotty would still be just as puzzled by the computer’s lack of responsiveness—unless, that is, he picked up one of the latest breed of smartphones. Being able to respond to the human voice has become the new frontier in interaction with these devices.

Apple’s iPhone 4S, which comes with a function called Siri—a “voice-driven assistant”—can take dictation, fix or cancel appointments, send emails, start phone calls, search the web and generally do all those things you might once have employed a secretary to do.

But Siri is not just voice dialling or voice recognition—which tries to turn speech into its text equivalent—it is natural language understanding. Siri grew out of a huge project inside the Pentagon’s Defence Advanced Research Projects Agency, those people who previously gave us the internet and, more recently, a scheme to encourage people to develop driverless cars.

Siri’s parent project, called Cognitive Assistant that Learns and Organises, had $200-million of funding and was the Unites States’s largest artificial intelligence project. In 2007 it was spun out into a separate business. Apple quietly acquired it in 2010 and incorporated it into its new phone.

When you ask or instruct Siri to do something, it first sends a little audio file of what you said over the air to some Apple servers, which use a voice recognition system from a company called Nuance to turn the speech—in a number of languages and dialects - into text. A huge set of Siri servers then processes that to try to work out what your words actually mean. That is the crucial natural language understanding part, which no one else does on a phone yet.

Natural language understanding
Then an instruction goes back to the phone, telling it to play a song or do a search—using the data search engine Wolfram Alpha—or compose an email, a text message, set a reminder or make a phone call. Natural language understanding has been one of the big unsolved computing problems—with image recognition and “intelligent” machines—for years now, but we are finally reaching a point when machines are powerful enough to understand what we tell them.

The challenge about natural language understanding is that, first, speech-to-text transcription can be tricky—did he just say: “This computer can wreck a nice beach” or “This computer can recognise speech”?—and second, acting on what has been said demands understanding both the context and the wider meaning.

The demonstration that computers have cracked this—just as their relentless improvement cracked draughts and then chess—came in February last year, when the IBM’s Watson system competed in the US game show Jeopardy!, in which the quizmaster provides an answer of sorts and the contestant has to come up with the question.

Except the Watson technology did not compete against just anyone. It was ranged against two humans who between them had scores of wins and had won millions of dollars in prize money from the game. They battled it out over “answers” such as “William Wilkinson’s An Account of the Principalities of Wallachia and Moldavia inspired this author’s most famous novel.”

Yes, of course, Bram Stoker’s Dracula. Watson, and the humans, answered correctly, but Watson had more points and won. Watson was not competing on absolutely level terms; it was fed the questions as text at the same time as they were read to the other two players. Still, its victory led to plenty of tweeting from people welcoming the new robotic overlord. And it was not even connected to the internet; it just relied on a database of information stored on its system.

“While competing at Jeopardy! is not the end goal,” its IBM engineers noted dryly, “it is a milestone in demonstrating a capability that no computer today exhibits—the ability to interact with humans in human terms over broad domains of knowledge.” And, of course, in understanding what people mean when they say something.

Having gained that glory, Watson has been shunted off to work on healthcare problems. You can imagine it popping up on some future episode of House, solving some gnarly medical conundrum. But much more interaction like Watson’s is likely because of the growing availability of cloud computing, in which huge amounts of processing power are available ad hoc over the internet. Amazon, for instance, now adds almost as much computing power a day to its cloudcomputing systems as it used to need for all its site in 2000.

Starting to seep in
That means natural language understanding is starting to seep into our daily lives: we have grown used to computers getting better at understanding us in limited contexts—when you phone an automated system and can pay bills just by reading out your payment card number or use booking systems that do not employ humans.

Now it is spreading wider, being used for “semantic analysis” of Facebook and Twitter postings by companies eager to figure out whether people are saying positive or negative things about them online.

Norman Winarsky, who co-ordinated the funding for the Siri company, said it is “real AI [artificial intelligence] with real market use”, adding that “Apple will enable millions of people to interact with machines with natural language. [Siri] will get things done and this is only the tip of the iceberg.”

What is happening now with natural language understanding could just be the beginning of a revolution in how we use computers (particularly smartphones) as big as the one caused by iPhone in 2007, when suddenly multi-touch screens were de rigueur. It has taken almost five years, but now multi-touch screens are everywhere and even the next version of Microsoft’s Windows, due this year, will offer swooshy touch-screen operations.

Horace Dediu, who runs the consultancy Asymco and formerly worked for the cellphone-maker Nokia, thinks the time is ripe for a new way of interacting with our computers. He singles out how Apple drove changes in interaction before: the mouse and windows in 1985, the iPod’s click wheel in 2001, multi-touch in 2007.

He also noted that Siri has some interesting similarities with what people thought about the original iPhone touch screen: “It’s not good enough; there are many smart people who are disappointed by it; competitors are dismissive; it does not need a traditional, expensive smartphone to run, but it uses a combination of local and cloud computing to solve the user’s problem.”

There has certainly been no shortage of people who say that Siri “isn’t good enough” because it cannot yet deal with every accent or every possible query. They also say that the whole idea of asking a disembodied computer questions is ridiculous. Which is odd when you think about it, because phones are designed for talking into and nobody seems to get embarrassed about announcing that they are on the train to a carriageful of people or reading their credit-card details out loudly for all to note.

The thing about natural language understanding is that people have been expecting it to happen for ages. In 1996 I watched Bill Gates announce that by 2011 we would have computers that could recognise their owners’ faces and voices. And he was right—if you count smartphones as computers, which they are, really.

Even so, it is not clear whether we will talk to our computers in our offices, Scotty-style. Apart from anything, the noise would just be maddening. Then again, that is probably what people thought about introducing telephones into offices. Just hold on for a few years, Scotty. The computer will hear you soon.

Client Media Releases

NWU specialist receives innovation management award
Reduce packaging waste: Ipsos poll
What is transactional SMS?
MTN on data pricing