/ 15 December 2016

​Mummy’s little helper is growing up

Mummys Little

Those who own the voice-activated gadget (known colloquially as Alexa, after its female interlocutor) are prone to proselytising “her” charms, applauding Alexa’s ability to call an Uber, order a pizza or check a pupil’s maths homework. The company says more than 5 000 people a day profess their love for Alexa.

On the other hand, Alexa devotees also know that, unless you speak to her … very … clearly … and … slowly, she’s likely to say: “Sorry, I don’t have the answer to that question.”

“I love her. I hate her. I love her,” one customer wrote on Amazon’s website, but still awarded Alexa five stars. “You will very quickly learn how to talk to her in a way that she will understand and it’s not unlike speaking to a small frustrating toddler.”

Voice recognition has come a long way in the past few years. But it’s still not good enough to popularise the technology for everyday use and usher in a new era of human-machine interaction, allowing us to talk with all our gadgets.

Despite advances in speech recognition, most people continue to swipe, tap and click. And probably will for the foreseeable future.

What’s holding back progress? Partly, the artificial intelligence that powers the technology has room to improve. There’s also a serious deficit of data — specifically audio of human voices speaking in many languages, accents and dialects in often noisy circumstances that can defeat the code.

So Amazon, Apple, Microsoft and China’s Baidu have embarked on a world-wide hunt for terabytes of human speech. Microsoft has set up mock apartments in cities around the globe to record volunteers speaking in a home setting.

Every hour, Amazon uploads Alexa queries to a vast digital warehouse. Baidu is busily collecting every dialect in China. Then they take all that data and use it to teach their computers how to parse, understand and respond to commands and queries.

The challenge is to find a way to capture natural, real-world conversations. Even 95% accuracy isn’t enough, says Adam Coates, who runs Baidu’s artificial intelligence lab in Sunnyvale, California.

“Our goal is to push the error rate down to 1%,” he says. “That’s where you can really trust the device to understand what you’re saying, and that will be transformative.”

Much of the progress owes a debt to the magic of neural networks, a form of artificial intelligence based loosely on the architecture of the human brain.

Neural networks learn without being explicitly programmed but generally require an enormous breadth and diversity of data. The more a speech recognition engine consumes, the better it gets at understanding different voices and the closer it gets to the eventual goal of having a natural conversation in many languages and situations.

Hence the global scramble to capture a multitude of voices. “The more data we shove in our systems, the better it performs,” says Andrew Ng, Baidu’s chief scientist. “This is why speech is such a capital-intensive exercise; not a lot of organisations have this much data.”

When you tell your phone to search for something, play a song or guide you to a destination, the chances are a company is recording it.

When you ask Alexa what the weather is or the latest football score, the gadget uses the queries to improve its understanding of natural language (although “she” isn’t listening to your conversations unless you say her name).

“By design, Alexa gets smarter as you use her,” says Nikko Strom, senior principal scientist for the programme.

Another challenge: teaching voice recognition technology to pick up commands over background noise — the clamour of happy hour, say, or the cacophony of a sports stadium. Microsoft has deployed an Xbox app called Voice Studio to harvest conversation over the din of users shooting villains or watching movies.

The company offered rewards including points and digital apparel for avatars and lured hundreds of subjects willing to contribute their game chatter to Microsoft’s speech efforts.

The program worked with Gangbusters in Brazil, where the local subsidiary promoted the app heavily on the main Xbox page. The data was used to create the Brazilian Portuguese version of Cortana, released earlier this year.

Companies are also designing voice recognition systems for specific situations. Microsoft has been testing technology that can answer travellers’ queries without being distracted by the constant barrage of flight announcements at airports.

The company’s technology is also being used in an automated ordering system for McDonald’s drive-thrus. Trained to ignore scratchy audio, screaming kids and “ums”, it can spit out a complicated order, getting even the condiments right.

Amazon is conducting tests in vehicles, challenging Alexa to work well with road noise and open windows.

Ask researchers like Ng when it will be possible to speak naturally to your digital assistant and they get wistful. No one really knows. Neural networks remain mysterious even to those who understand them best.

And much of the work is trial and error; make a tweak here and you’re never quite sure what will happen there.

Based on the current technology and methods, the process will probably take years. But scientists say you never know when a breakthrough will arrive, catapulting research forward and turning Alexa and Siri into true conversationalists. — Bloomberg