The idea that computers should deal with voice has been around a long time. Nowadays, good speech recognition software has 99% accuracy, which certainly means that it’s not a problem for a computer to understand individual words. That can give us confidence for some good voice-based applications including transcription (digital speech-to-text), home automation, medical transcription, mobile telephony, automobile automation, or computer-aided language learning. However, when it comes down to voice command recognition computer user interface, the idea miserably fails.
For humans, speech is a natural and most preferred way to communicate in most situations. In some cases we prefer pointing, showing by example, or showing gestures. For example, it is easier to point a person within a crowd than to describe the location with words. Similarly, it is easier to show how to swing a golf club than to deliver an oral explanation of the technique.
When working on a computer, speech is neither natural nor preferred mostly due to the nature of work we typically carry out. In the best cases, we can use a combination of voice and pointing with mouse or touch screens.
Let’s examine an ideal way of preparing a text document. Naturally, dictation would be perhaps the best way to write a text. On the other hand, for formatting that text I would still prefer pointing and marking and then changing colors, fonts or sizes. How would you at all be able explain the exact color you want for your document title without pointing at it? We can quickly conclude that preparing a text document would be one of the better examples where voice recognition could be used as part of the interaction. When it comes to preparing a spreadsheet document, browsing the Internet or working with photos, we are facing bigger challenges. When shopping over the Internet, for example, it is easier and faster to point at the item you want instead of describing it, as it is easier and faster to go to a supermarket and pick things directly from the shelves instead of stand in front of a counter and try to describe them to the cashier. “Yes, I want 300 g of this or that salami . . . not this brand, the one to your left … or no, that is way too much, remove some of it.” Who has the time and the patience to deal with an intermediary when he/she can do things himself? In short, the problem with speech interaction with a computer is not the computer’s ability to understand us, but our own nature and the nature of tasks we do with computers. People interchange verbal communication with non-verbal one and depending on their character, mood or the available time, choose which one to use, mostly relying on a combination of both.
As Artificial Intelligence is making progress with natural language processing, machine perception and machine learning, we are making computers more comparable to humans. One day unfortunately, we would have robotic assistants which would be able to independently complete certain tasks we have ordered them to do. This may include delivering packages within an office building, bringing and passing tools, or cleaning homes and offices. Those robots would independently complete entire tasks and we would interact with them primarily through voice. But, as long as some kind of desktop or laptop computer is used, just as a tool that helps us do the actual work, it would be easier to select, scroll, mark, drag or rotate. In other words, point in one way or another.
Orginally published by Boris Motusic at The Technology Perspective on 14/12/2007