Johan Schalkwyk, voice-recognition expert and Google engineer, raises his Android phone to his mouth. “Lekker melktert resep,” he says.
A moment later the Google page on his smartphone’s browser has pulled up the relevant results. Schalkwyk, a South African who now lives in New York, can get baking and make his mother proud.
Schalkwyk was in Johannesburg this week to demonstrate the Google Voice Search. The service is already available in more than a dozen languages. Now the South African version makes it possible to use voice commands to search the web in South African English, Afrikaans or isiZulu.
To nail down the ins-and-outs of how South Africans speak, Google enlisted the aid of the Meraka Institute at the Centre for Scientific and Industrial Research and researchers at the North West University. And, with the help of almost 800 local volunteers who spent their times speaking commands to Google, they got Google’s cutting edge voice-recognition software to be able to differentiate between the “r” in “butter” and the “r” in “lekker”, between the “x” in box and the “x” in “Xhosa”, and to recognise that a traffic light and a robot are in fact the same thing.
Leading pioneer
Schalkwyk, a University of Pretoria alumnus, is one of the world’s leading pioneers in the field of speech recognition and the driving force between Google’s scarily accurate Voice Search feature.
To begin his demonstration, Schalkwyk asked the gathered reporters to imagine they were about to take an international flight, and needed to know whether the flight would be departing on time.
He held up a smartphone and, speaking into the receiver, said “South African Airways Flight 204.”
A second later, the results appeared on screen. The first result, from flightstats.com, showed that the New York to Johannesburg flight was on schedule.
And it didn’t stop there. “Weather in Johannesburg South Africa” returned a result almost instantly. “Convert 82 degrees Fahrenheit into Celsius” returned 27,7 degrees. Saying “Sloane Street, Bryanston, Johannesburg” calls up a Google Map marking the location of Google’s local headquarters.
How about some Afrikaans? “Lekker melktert resep,” “boesmanland se biltong” and “prentjies van die Drakensberg” all returned accurate search results.
Even the isiZulu “ixoxo”, with its tongue-twisting click sounds, was recognised.
At this point there was incredulous laughter from the gathering.
“Part of Google’s philosophy is to make searching accessible. It’s very natural, as a human-computer interaction, to use your voice. Using your voice to search is much easier than typing,” said Schalkwyk.
But how does it work?
Watch the animated demonstration video.
Tricky thing
“My voice is recorded in real time and streamed over to Google servers anywhere in the world. Voice recognisers will recognise the voice and translate it to text,” he explained.
Voice recognition is a tricky thing. The software that translates a voice search into a text search must take into account different accents, ages of speakers and background noise. Different information must be filtered out of a search query if it is presented by an elderly English-speaking man sitting in a quiet room compared with a young isiZulu-speaking girl in a crowded mall.
“It has to work in these environments otherwise what’s the point,” said Schalkwyk.
Gartner analyst Nick Jones recently predicted that by 2014 smartphone penetration in South Africa would reach 80%. Add to this the fact that 25% of all searches on Google happen during the day on a cellphone, while on the weekend 65% of searches happen on a cellphone, and it’s easy to see why voice search is such an important venture for Google.
A work in progress
Despite this leap forward in the evolution of human-computer interaction, content itself is still a work-in-progress. Using voice search only gets you past the first phase of the search. Once the search results appear on screen, you still need to browse through them yourself and physically select the most appropriate option.
Though software engineers are currently working on a system that would facilitate both voice input and voice output, there are still many issues to be worked out regarding the interface.
If in future you could not only enter a search query verbally, but also receive the results verbally, how do you browse through the many possible options or select the most appropriate? How do you select the most appropriate result for further searching? It may take a lot of listening.
Schalkwyk says these are questions that need to be answered in future and points out that he does not believe that having search results returned in text format is necessarily a failing. “We call it a multi-modal interface,” he said.
“Depending on the situation, text output can be better than spoken output. Sometimes touching is better than speaking. But if you’re in your car, you won’t want to do that. You’ll want to use voice. It’s usually a question of using what works best. The key is to make all of the modalities possible.”
Obstacles to voice search
There are other caveats to the technology. For one thing, it only works on smartphones. Phones that run Android 2.1 or later will have the Quick Search Box needed for a voice-powered search pre-installed. iPhone and BlackBerry users must download the Google Mobile App.
The speed with which a search is carried out and a result returned is also dependant on the strength of the WiFi or 3G connection that you are using. Schalkwyk’s demonstration was held up at one point because of the spotty wireless reception in the boardroom where the presentation was held.
But Etienne Barnard, a research professor at the North West University and a chief researcher at the Council for Scientific and Industrial Research’s Meraka Institute, believes that these are obstacles that can be overcome in the near future.
“The phone is not going to be the issue, connectivity is not going to be the issue. Content is the problem,” he said. On the internet, Zulu content is scarce and Afrikaans content is only slightly more abundant.
Barnard believes that one way to improve the depth of local content available on the internet is to create partnerships with government and business who can support the creation of local databases of information that is relevant to particular communities, in their specific language.