Once upon a time Robert Heinlein wrote a book about a car with a voice synthesizer. You talked to it, it talked back. When it started speaking in phrases that the driver had never added, he chalked it up to one of his friends messing around with the database. Eventually it got to the point where he wasn’t sure that the car wasn’t simply sentient.
I think that there’s lots of room for voice applications in today’s user interfaces. Take the GPS in your car. I expect that it talks. You’re not supposed to read the screen while you’re trying to drive, right? So let’s think about how it talks.
One way is to have a good solid database of what it can say, including names of all the streets. That’s very handy, because instead of just “Turn left” it can say “Turn left on Quantum Street.” But the downside is that you’ll be limited in other ways, and there’ll probably be very few ways that the computer phrases certain things. It’s strength lies in being able to read you what is effectively a bunch of proper names.
The other approach to take is more generic, and not speak the street names. “Turn left” and you rely on the layout of the road to know that you’re taking the left it wanted you to take. Of course there is still an image on the screen, which can have the street name, so it takes half a second to look down and see whether you’re going where you are supposed to.
BUT! Once you substantially drop the size of the database by focusing only on a handful of “Turn left” / “Turn right” types of phrases, now you’ve opened the door to downloading your choice of voice. You could have a celebrity read the 100 or so phrases, zip it up, and there you go.
Even better, you could be given instructions for how to do it yourself. My TomTom has this, but I’ve never taken advantage of it. I do have John Cleese from Monty Python doing my directions, but I’ve never sat down to record the kids and wife doing it (“Turn left here, turn left, here, turn left! You’ll kill us all!”). The instructions were fairly complicated, explaining how you had to get each soundclip exactly the right length for it not to sound broken. Forget that.
What I’d love to see, though, and this gets be back to my original story, is more variety. The TomTom worked by saying “Ok, provide a folder consisting of 59 sound files labelled as follows.” There’s a single sound clip for “end of trip”, so no matter how many times I use it, John Cleese is always there saying “You have reached your destination. You may get out now, but I’m not going to carry your bags.”
What if instead of 59 sound files I was told to provide 59 directories, and in each directory there could be as many files as I wanted? And whenever the device needed to play something it would instead just grab a random file from the appropriate directory?
That would be AWESOME. For the most common ones you could sit down and record a good couple of dozen different ways to say it. Maybe set something up that is user contributed so you just plain don’t know how many different sounds there are? Or set a priority on them so that most of the time you get “Turn right” but once in a hundred you get “Turn right, moron.”
This approach, of course, would thus work for any device that talks back to you. Your answering machine, maybe?