When humans are in a critical situation, they don’t generally turn to the computer keyboard or mouse to communicate - they just talk (scream or shout). Humans’ need for communication in a digital world is tapping into the ambient natural interfaces and voice is on the cusp of becoming mainstream. But we have all become exasperated with the patchy results of getting our smart TVs to change channel on our verbal command or getting our car navigation systems to recognise the name of our destination – there is still some way to go.

Delivering digital breakthroughs that transform markets

There is an opportunity for organisations to provide more accurate, responsive natural interfaces to their products – going beyond just voice – to increase user convenience and maintain competitiveness. This leads us to consider, what if our digital assistants and machines could supplement verbal communication with context, taking into considering gestures, gaze, body language and facial expressions?

Achieving a truly natural and simple human computer would allow us to get more value and control from these new products. This becomes ever critical as the products and machines around of us become more complex and human users are more time poor.

The half-engaged world

Increasingly when I set out to take on a DIY task around the home, alongside my screwdriver and electric drill, I must have my reading glasses and smartphone. Much of the instruction for repairing is found on websites and YouTube. It’s cumbersome to use a screen in such a way, since you need your hands and eyes on the main task. In these circumstances, digital interfaces need to be present, but need to take a step back in engagement and not demand our direct interaction with screens. We ask of our digital interfaces to be half-engaged. Many manual tasks, be they domestic or vocational, fit into this model. A cook, a surgeon, a service engineer, or a farmer all need their focus on the main task.

Clearly voice is a powerful way for humans to communicate to digital things. The rise of digital voice assistants makes this very accessible with the main platform vendors providing sophisticated natural language processing (NLP) engines, such as Amazon Alexa, Microsoft Luis, Google duplex, IBM Watson and Oracle Digital Assistant. All these platforms are growing in sophistication – the underlying functionality is improving, and designers and developers are maturing in the conversational design skills that make the interaction effective, but in a human-to-human interaction we rarely just rely on voice - we use multiple senses, which was the topic of my last article.

Multi-sensory digital interfaces

While keeping our hands free and our eyes focussed, we can supplement verbal commands by using gestures, gaze, body language and facial expressions to give context, sentiment, and non-verbal instructions. AI powered computer vision, among other sensing technologies, provides us the means to read these non-verbal communications and to blend them all together to get a more complete understanding of what a user is communicating.

With this multi-sensory human interface, we can read deeper into what the human wants. We can tap into their mental and emotional state and understand if a user is confused (“I don’t know where to put the dishwasher salt”), tense (“I am not sure if it’s the blue or the red wire to cut”) or happy (“that tastes absolutely gorgeous”). We can give our digital assistance a form of emotional artificial intelligence. Dr Sally Epstein explores this further in her blog. By using computer vision, multi-sensory digital assistants can understand human movement and provide physical guidance of how to conduct a task. The virtual yoga instructor can now give feedback on how a frustrated student can better do an exercise or a stressed surgeon could be guided to provide more pressure in an incision. 

Multi-sensory conversational design

Extracting intent and entity from the typed or spoken words are core skills to implement a natural language dialogue for Alexa or a chatbot. The intent is what the user wants to do (e.g. “order a pizza”) and the entity is attributes of that intent (e.g. “Margherita”, “large”). In multi-sensory conversational design, we extend the approach by bringing in context. These are non-spoken attributes of the dialogue to enrich the intent. A spoken word might be “white coffees please” shouted across a café while the user holds up two fingers to indicate they want two cups and points to the table where they are to be delivered.

Our Synapse team in San Francisco has recently implemented a perfect example of a multi-sensory domestic solution. By incorporating cameras, sensors, and reactive controls into a kitchen hotplate, a cook can use a voice command to say, for example, “turn that burner off!” The system uses Amazon’s Alexa Voice Services platform to interpret the speech, and computer vision-based eye tracking to determine which of the four burners the user was addressing. You can see more in the video below.

https://www.youtube.com/watch?v=8F0zYfmOXo4&feature=emb_title

One of the notorious challenges of conversational voice design is dealing with user navigation. With a voice-only interface the user does not have any sense of where the user is in the dialogue and consequently good voice design requires any spoken response to be long-winded to keep the user in step with the dialogue state. For example, a question such as “how much flour should I add” is responded to “for the light pancake mix, please add 350g of plain flour”. With the integrated multi-sensory environment, the state or navigation can be conveyed back to the user by changing the state or position of an object. In the video above, the user can physically see the setting of the hotplate and hence an instruction to “turn it down more” is sensible.

Another form of context is background knowledge of a user’s mental model of a conversational topic. For example, one of the important lessons we have learned from our experience of voice in product augmentation is to use the language of the user and not be strict to the formalised language of your organisation. The correct product name for a toothpaste might be “Dental Whitener Extreme, 50ml” but your customer might know it as “the pink striped tube”. Having both an extended learning data set based on real user input and an ability to augment the vocabulary of the NLP is crucial make the conversation natural and a delight and not an irritation and inconvenience.

Voice everywhere

We have recently become an Amazon Alexa Voice Services Consulting and Professional Services Partner and we are excited about the range of Alexa technology now available. While Alexa-based smart speakers like the Echo Dot are now prevalent in many homes, the possibility from the many form factors of the software development kit (SDK) means that the technology can also be embedded into many physical form factors, and we are excited by the possibility of applying the multi-sensory human interface approach in many domestic and industrial settings.

If you have a product or service that you would like to provide an enhanced user experience in the model of half-engagement, or you simply want help in making a better voice interface, we can help you with your strategy. 

Martin Cookson
Author
Martin Cookson
Director of Service Innovation

Martin works with clients to help realise their ambition in developing digital services encompassing AI, UX, cloud computing and network native. He has 30+ years experience working in innovation with international clients and applying new technologies to bring new services live.