Implementing limited voice commands - feasibility and usability

30 Nov 2011 - 6:46pm
2 years ago
5 replies
959 reads
DamonV
2010

Hi everyone,

I'm redesigning a software tool for doctors, and am considering including voice commands in the section where doctors need their hands and eyes on the patient, not the keyboard and screen. However, I've never designed for voice commands before, so I'm trying to understand feasibility and usability. Here are some details:

  • Voice commands would be an alternative, optional way to operate select screens.
  • On each of these screens there is a limited number of voice commands the user can give (displayed on-screen).
  • In total, I expect to need no more than a dozen or-so single word commands ("next", "undo", etc) as well as digits ("one two eight point five").
  • The software is offered to English and German speaking users.
  • The doctor may be 5 feet or-so away from the computer microphone
  • The room in which it's used is mostly quiet.

What I'd like to know specifically:

  • How much training would be required up-front to have the software be able to recognize the user's voice for this limited vocabulary)?
  • During use, how reliable is recognition?
  • How big of a development effort is required to integrate 3rd party voice recognition software for this?
  • Anything I'm overlooking?


Thanks in advance for any help you can offer!
Damon

Comments

30 Nov 2011 - 8:08pm
Adam Korman
2004

I haven't worked with this kind of thing for about 10 years, but (Siri aside) the story hasn't changed much since then. In general, systems that need to recognize only a limited set of commands work pretty well, and do not require user-specific training. So, what you're looking to do should work pretty well.

Keep in mind that if you have 2-way voice interactions (i.e., if the system will prompt and respond to the user with recorded or synthesized speech), the manner in which the system speaks has a huge impact on the overall perceived experience (irrespective of the actual accuracy of the speech recognition). So, the system's output should match the sophistication of the input it can handle. When a voice system speaks to people using natural language (full sentences, etc.), people respond in kind and tend to have high expectations -- they often speak faster, with longer phrases, try to test the limits of what it will understand and tend to get angry when the system fails. When a system uses terse prompts & responses (single words or short phrases), people tend to respond with by enunciating more clearly with simple commands and are (somewhat) more patient if the system doesn't understand them.

-Adam

 

2 Dec 2011 - 3:18am
Phillip Hunter
2006

I've worked extensively in design in the speech reco field. Here are some direct answers and helpful things to consider.

  • Voice commands would be an alternative, optional way to operate select screens.
  • >>>>> How would the doctor activate voice? By click? By wake-up command? Would this be modal, one way at a time? Can they switch easily? Will they need to?
  • On each of these screens there is a limited number of voice commands the user can give (displayed on-screen).
  • >>>>> I think you indicated that the doctor isn't actually looking at the screen when saying them. If not, will the doctor remember them? Should there be a way for the doc to tell the computer to list the commands out loud? Will the commands clutter the screen or obscure content?
  • In total, I expect to need no more than a dozen or-so single word commands ("next", "undo", etc) as well as digits ("one two eight point five").
  • >>>>> At a time or total?
  • The software is offered to English and German speaking users.
  • >>>>> Remember that doing this is twice the design effort. Speech is not that easy to localize.
  • The doctor may be 5 feet or-so away from the computer microphone
  • >>>>> This sort of answers the on-screen question, as in the doctors won't be able to easily see the commands. That brings up more issues. Also, this introduces a degraded signal-to-noise ratio
  • The room in which it's used is mostly quiet.
  • >>>>> But the walls, floor, and windows will like cause echoes.

What I'd like to know specifically:

  • How much training would be required up-front to have the software be able to recognize the user's voice for this limited vocabulary)?
  • >>>>> Probably very little to none depending on which engine you use.
  • During use, how reliable is recognition?
  • >>>>> speech reco is very reliable under controlled circumstances.
  • How big of a development effort is required to integrate 3rd party voice recognition software for this?
  • >>>>> Medium size, but the challenge is finding development experience who can do it well.
  • Anything I'm overlooking?
  • >>>>> Several things up above. Plus, see if these doctors have used speech reco before, for example for notes dictation. If so, ask them about their usage to get some ideas.

 

Phillip

2 Dec 2011 - 1:05pm
DamonV
2010

Thanks Philip, much appreciated! Could you clarify the challenge in localizing for a second language (vs other types of localization)? Thanks again! Damon

Sent from my iPhone

On Dec 2, 2011, at 3:35 AM, Phillip Hunter wrote:

> I've worked extensively in design in the speech reco field. Here are some direct answers and helpful things to consider. > > * Voice commands would be an alternative, optional way to operate select > screens. > * >>>>> How would the doctor activate voice? By click? By wake-up command? > Would this be modal, one way at a time? Can they switch easily? Will they > need to? > * On each of these screens there is a limited number of voice commands the > user can give (displayed on-screen). > * >>>>> I think you indicated that the doctor isn't actually looking at the > screen when saying them. If not, will the doctor remember them? Should > there be a way for the doc to tell the computer to list the commands out > loud? Will the commands clutter the screen or obscure content? > * In > total, I expect to need no more than a dozen or-so single word commands > ("next", "undo", etc) as well as digits ("one two eight point five"). > * >>>>> At a time or total? > * The software is offered to English and German speaking users. > * >>>>> Remember that doing this is twice the design effort. Speech is not > that easy to localize. > * The doctor may be 5 feet or-so away from the computer microphone > * >>>>> This sort of answers the on-screen question, as in the doctors won't > be able to easily see the commands. That brings up more issues. Also, this > introduces a degraded signal-to-noise ratio > * The room in which it's used is mostly quiet. > * >>>>> But the walls, floor, and windows will like cause echoes. > > What I'd like to know specifically: > > * How > much training would be required up-front to have the software be able > to recognize the user's voice for this limited vocabulary)? > * >>>>> Probably very little to none depending on which engine you use. > * During use, how reliable is recognition? > * >>>>> speech reco is very reliable under controlled circumstances. > * How big of a development effort is required to integrate 3rd party voice > recognition software for this? > * >>>>> Medium size, but the challenge is finding development experience who > can do it well. > * Anything I'm overlooking? > * >>>>> Several things up above. Plus, see if these doctors have used speech > reco before, for example for notes dictation. If so, ask them about their > usage to get some ideas. > >
> > Phillip > >

2 Dec 2011 - 2:35pm
LFrancis
2009

Hi,

Just quickly reading some of the comments here makes me think that maybe there are some additional product ideas that might help to make this work. How about some kind of hands-free product that allows the doctor to place the device, (I'm assuming it is a smart phone or even pad of some kind). the device could be set up to go around their neck and have the capability of folding the device out so they could see the screen but adjust it so they could also see what they were doing. it might even be really cool to be able to have a command that would let them take a photo of whatever they are looking at. This could be helpful for further diagnosis and notes and benchmarking of a person's affliction. I would recommend that the fold out mechanism be adjustible and that it be limited so that there would always be enough space to properly record the voice. Of course, if the doctor was using a blue tooth device this could help as well. It may still be good for them to be able to see the screen etc...Presumably, eventually there will be applications available that will tie in to hospital records, or even the doctors records that could also be displayed on the phone/pad device that could be very useful as well...

Anyway, I guess my thought is to expand the thinking into the whole experience, which is very tied to the hard products as well as the software and the concept of hands-free and voice commands.

best!

Linda

5 Dec 2011 - 8:08pm
Phillip Hunter
2006

@DamonV - The issue around localization is that it just gets trickier when spoken language is involved. In almost any culture, spoken and written language differ from each other. Idioms, structure, etc. change between the two forms. It can be hard enough wrapping your mind around that with your own language, and then consider the fact that none of us can really even begin to appreciate the colloquial version of a language we don't speak.

Now, since you are offering commands, this issue won't be as large, but on the other hand, you can't expect users to only say the commands while the tool is active, so accounting for non-meaningful words might be the challenge.

Syndicate content Get the feed