Context Awareness in Multi-Device Voice Scenarios

Background
I was fortunate enough to work with the Voice and AI UX team. Through this I got the opportunity to explore an area I’ve always been extremely curious about - making conversations with a voice assistant feel more natural.
Outcomes
A framework for an assistant more aware about the context of use, helping it react accordingly. I also designed the conversational guidelines for such an assistant without invocation.
Final Presentation

Understanding the Voice Agent

From the very beginning, I knew that I wanted to explore how we can better design voice experiences- To make talking to an AI more natural.

For that I first wanted to start by understanding the technology itself. I wanted to peer into the ‘black box’ and understand how a voice agent works. This resulted in an extensive literature review on the technology behind Text to Speech and Natural Language Understanding.
What makes a voice assistant sound the way it does
How the AI knows what any word means in a sentence

What is 'More Natural'?

I then went on to understand the seemingly simple concept of what it means to have “natural communication”, which unraveled a lot of interesting questions like -
  • Is it natural with regard to how we would expect a human to react?
  • Natural with regard to how users expect something to work? Would a human find a talking toaster to be natural? When does learned behaviour become natural?
  • If a technology is too human-like, does it feel unnatural?
  • Where does one draw the line, without scaring the user or being intrusive?
After a lot of thinking and reading research studies, I came to the conclusion that for voice assistants, being more natural meant being able to engage in & hold conversations like fellow humans.
Areas which could make voice experiences more human like
I listed down the areas which could be improved to get a more robust view of the opportunities for refinement.

Areas such as pauses between sentences at the right places (using pauses for a more emotive response); mirroring the user’s voice/volume in conversation (you speak faster when someone speaks faster, since you discern that they are in a hurry and so on).
In the scope of my internship, I decided to focus on two of these areas. Working on them would be most feasible given the timeline as well as their potential impact on the voice experience.

Invocation
Looking at wake words and the different kinds of ways to initiate dialogue and sustain conversation.

(Right now we say ‘Hey Google!’ or ‘Alexa’ every time we want to invoke the voice agent, however, when communicating with humans, we rarely repeat their name, it just feels unnatural.)


Context
Figuring out how the agent can be more aware of the context user is in.

(When you’re talking to your assistant at midnight, you don’t want the assistant to reply back in full volume. How can the assistant be more understanding of the context the user is in?)

Digging Deeper

A lot of work has been done in trying to identify how notifications can be more aware of the context of use. I looked at studies and previous work to gain insights on how a voice agent can be made context-aware as well.

Because being more aware of the user’s context requires knowing more information about the user, I also looked at privacy preferences users had. Especially when it comes to sharing their data to get a better experience.
The tension between privacy concerns and contextual interruptions

The Competitive Landscape

Most of today’s experiences focus on command-and-control scenarios or very simple question-and-answer exchanges.

The outcome of starting any 'conversation' is either (a) a statement or (b) an action taken. There’s almost no back-and-forth, nothing learned from the user, no mutual context or trust built over time.

"Alexa, what’s the weather in Seattle on Sunday?"

"Hey Cortana, open Excel."

"OK Google, how tall is Mount Everest?"

Barely a Conversation. Yet, the term we use is ‘conversational design’.

There have been some innovations in these assistants which makes for interactions that are more natural.
Trends observed in other leading assistants at that time
Insight
After my research, I narrowed down the data I collected to see patterns. This led to a set of key insights that I wanted to keep in mind while trying to make a system more aware of the user’s context.
Proactive Assistance
I found that proactively assisting users at the right point can help complete their intent in a much more efficient way while also avoiding errors that the user might be prone to.
Privacy Preferences
Research suggests if the system does monitor the users (unobtrusively and real-time), the degree of comfort for the users usually depends on how helpful the agent is being in the situation
Interruptions
One critical aspect when providing adequate user assistance is to find the right point to interrupt users. Badly timed interruptions can affect performance and also cause negative user states such as annoyance, frustration and cognitive overload.

Imagining a Future without Invocation

Looking at the current tech trends, somewhere in the future, we will have voice agents that do not always need an invocation and just know when they are being spoken to.

This could be achieved in multiple ways, including having a real time stream only recognising utterances meant for them, for a model so well trained that it doesn’t require to store the data it receives from the user. Getting rid of the privacy concerns users have with storage of their audio. Or using technology unknown as of today, as a breakthrough in voice tech.

However, a lot of the current guidelines we have will fail for this hypothetical agent.

Multiple new kinds of scenarios and errors will require a fresh approach to designing for the same. After a lot of ideation, I came up with the following set of principles which would define the conversational guidelines.
Design Principles
  • Maximize Efficiency: Agent should allow non-wake word related invocations
  • Forsee Unexpected Errors:  The experience should accommodate for new kinds of errors
  • Maintain Transparency:  Users should always be able to ask the agent why any action was taken
  • Ensure Clarity: Agent should reaffirm in situations to confirm the user is on the same page
  • Provide Control: The user should feel comfortable and in-control at all times

Creating an Understanding of Context

For such an agent, it is also extremely important to be unobtrusive and aware of the context of use. I looked at some of the different parameters that can be taken into consideration while trying to do so, and came up with a framework which can be used to design the system.

In Multi-Device Scenarios, a combination of these parameters will allow the Agent to make better judgement about the user’s context.

Designing the Concept

I’m unable to share the complete conversational design guidelines made since they contain use-cases and scenarios for multi-device environments which Samsung may work on in the future.

I can however, share some example use cases of the context-aware agent, along with snippets of the guidelines.
Device Context
As our voice assistants get on more devices, without invocation, in case of action needed to be taken, there might be ambiguity as to which device they need to complete it on.

In such cases, being aware of the device currently being used is extremely helpful.
User was using the phone, then kept it on the table.

User then asks assistant to send their flight tickets.

Agent recognizes that the user is now on their laptop, so sends it there.
Noise and Activity
A lot of our voice agents reside inside our homes now, however, they don’t always account for situations where we might have guests over, or there is some noise/activity going on.

The agent can take into Ambient noise levels and know if there are multiple people in the same shared home environment or just the user/ just the family members.

This can be used to understand when to keep sensitive information to itself or even be more careful before directly saying things out loud which you might have told the agent in the security of your home.
User has friends over.

User had earlier told the agent to read emails everyday at 8pm.

Agent does read them aloud, instead asks user if it’s still okay to do so - as a notification on their phone.
Tone
Multiple people can say the same sentence, and each tone of voice will convey different psychological information. It shows you how there’s both verbal and nonverbal meaning in all our words. The nonverbal part is harder to control, so it’s also more genuine.

For example - Right after the loss of a loved one, it would be really insensitive to remind the agent to go jogging. In such cases, the agent has an opportunity to be more emotive by understanding the tone of the user and changing it’s tone/behaviour accordingly.
User gets some bad news.

Agent recognizes the user’s emotional state and assumes the appropriate tone.

Agent acts sensitive to the situation.

Conversational Design Guidelines

Based on the Design principles, these are some snippets of the conversational design for the conceptual voice agent.
We decided that to match you with people based on your interests, Moment first needs to get to know you better.

Learnings

Over the summer, I’m grateful to have been able to explore a part of the exciting new possibilities that voice assistants can bring about. Going through this project made me deeply realize the amount of untapped potential in how smart devices can be used.
Making Guidelines is Hard Work
  • Having designed conversational flows before, I thought I was prepared for designing conversational guidelines — spoiler alert I was extremely wrong.

  • I ended up struggling a lot in turning various use-cases I found into actionable guidelines that could be implemented.
  • After a lot of toying with words and help from my mentor, I came up with a set of guidelines to be used.
  • I learnt that it is very important to know when to be generic and when to be extremely specific while approaching conversational guidelines.
Lean into the Ambiguity
  • Working in an environment with such a rapid pace of development also taught me the importance of embracing ambiguity.
  • Sometimes when you explore a completely new field, nothing is pre-defined and you have to learn by making mistakes, and that’s okay!
Let's talk about why conversations are still so hard to get right. Say hi!
click to compose email
click to copy email id
Copied to clipboard!