New to the iPhone 4S will be a software personal assistant named Siri. Simply by speaking a request in natural language, Siri will perform a task to the best of its ability. This is a big deal. If you don’t believe me, just ask Bill Gates.
No, that wasn’t a typo. Siri may be rolling out on an Apple product, but through the years, Microsoft co-founder Bill Gates has been a persistent advocate of natural user interface, including voice and speech.
“Look what was written down from when Paul and I started Microsoft. Half the things we dreamed of as scenarios for software to solve are still in front of us. Natural interface including speech, and the kind of inking that comes out on the tablet.”
For years, the Microsoft chairman has been a fiery advocate, inside the company and out, for the notion that computers should be controlled, not just by mouse and keyboard, but also by more natural means, such as voice, touch and digital ink.
Microsoft may not have been able to make that happen yet, but clearly Gates thinks speech and voice input is a really big deal. But, and this is the important part, speech and voice isn’t just about recognition – it’s about comprehension, and that’s where Siri stands apart.
Natural user interface
When Gates talks about pen, touch and voice, he’s talking about Natural User Interface, interacting with a computer in a manner that is natural and normal for regular people, not just for computer geeks. This is where voice interaction has fallen short over the years.
Speech-to-text has improved tremendously over the years, but that’s primarily transcription, a direct spoken word to typed word conversion. Improvement in voice control has derived primarily from those speech-to-text improvements. More accurate speech-to-text improves the computer’s ability to recognize spoken commands, but they don’t allow a computer to understand anything beyond its fixed list of commands.
Until now, computers have required commands to be spelled out precisely. In order to tell a computer to do something, you needed to speak to it in specific terms the computer could understand. Just like typing commands into a line prompt, deviation was not allowed. You had to talk like a computer to talk to your computer. That’s not natural. That’s where Siri promises to be different.
There’s a reason Apple describes Siri as artificial intelligence and not voice control. It’s because Siri was born from SRI International’s Artificial Intelligence Center and an AI project called CALO. According to Siri co-founder Norman Winarsky, its speech recognition component is modular and interchangeable, not an integrated component. First and foremost, Siri is AI, not voice control. 9to5Mac has the full story from Winarsky on what Siri is and how it arose.
This is not to say that Siri will one day become sentient and go Skynet on us. It’s not that kind of AI. Perhaps it would be more precise to refer to it as “language recognition”.
When Google introduced Voice Actions for Android, their tagline was “Just speak it”. Speak what? Well, they have a list of 9 commands and the contexts that each will recognize. Say “directions to Starbucks” to map a path to the nearest Starbucks, but if you ask “Where’s Starbucks?”, it won’t know what to do. Voice Control on previous versions of the iPhone offer a few more commands but is similarly limited. Both offer voice recognition. Neither have language recognition.
By contrast, Apple hasn’t posted a list of commands for Siri. They offer suggestions, like “Tell my wife I’m running late.”, “Remind me to call the vet.” and “Any good burger joints around here?”, but there’s no formula to follow, no specific method to remember. The only requirement is that you speak like a normal person. In other words, you don’t just speak to Siri – you speak to it naturally.
Will it work?
Of course, the big question is whether this language recognition system will work as advertised. The demos are impressive, but there’s always need for skepticism. There’s also the fact that Siri was a standalone app for a short time (before being bought by Apple) and didn’t instantly set the world on fire. Certainly it’s fair to have doubts.
But regardless of how it works out of the box, I believe we can and should rightly look at Siri as a turning point in voice interface. Natural user interfaces, like voice, only work when the interface is natural. Requiring people to speak in fixed terms is not natural. Siri lifts that limitation.
Think about touch input. We had touchscreen phones before the iPhone, so what made this device so different? Multi-touch. Capacitive recognition of fingers, not styluses. Gesture commands. It offered a touchscreen that worked in a manner that was more natural, more intuitive so that even the uninitiated could use it.
Capacitive multi-touch was the breakthrough change that turned touchscreen input into natural input. Siri promises to do the same for voice input. Out of the box, Siri might not be smart enough to recognize dumb commands like “I want nachos” but it’s on the road to do it some day. [Edit: Apparently Siri is smart enough to answer dumb questions like this.]
Where that road may go
Siri lives basically as a separate app that ties into other apps. What will be more exciting is if that relationship goes in the other direction. Once Siri gets out of beta testing, it may be opened to developers as part of the iOS interface, just like multi-touch or motion. If that happens, speaking to your iPhone won’t be just about using it as a virtual assistant.
Think about an app like Pocket God where you use touch and motion to interact with the virtual environment in different ways. You touch to grab pygmies and stir up storms. You use motion to turn the world upside down. Now imagine adding voice commands to the mix. Tell those pygmies to dance. Call down a meteor shower. Make them hear your wrath.
Now take that frivolous example and apply it to other apps like Yelp or Facebook. Call up all the five star reviews without searching or sorting. Jump around to different friends’ pages simply by asking. Think about how complex tasks that would take many taps to accomplish could one day be performed just by asking.
Voice control doesn’t have to be limited to a virtual assistant app. It can be a vital part of the interface, just like multi-touch and motion, adding another dimension of control. Everything working together to give the user the best tools for different tasks. That’s what Bill Gates means when he talks about the many methods of natural user interface. It’s not about entering text in different ways. It’s about doing different things, more things, with computing in ways that are natural and intuitive. That’s what this is really about. That’s why Siri is a big deal.