Cognitive capacity. Why a VUI can’t work like a GUI.

October 1, 2019 by Brooks Richey

A pure voice user interface, like Amazon Alexa, requires different design thinking then when developing content for a graphical user interface (GUI). With a VUI, you are not designing to navigate and exchange information through the eyes. Rather, you are designing and sharing content and microcopy for the ears.

Specifically, the exchange of information structured within sound, language rules, and thought is being coordinated by both the human brain and the intelligence powering the voice assistant.

In a voice-only interface, the brain and the voice assistant are both relying on the information contained in any audio response to get the information they need to complete the desired task.

But while a user receives audio information from the VUI, the user only has the bandwidth to manage a limited amount of information at any given time from a voice-only interface.

The bottleneck of data is coming from your brain and is a concept called cognitive capacity.

Cognitive capacity.

The idea of cognitive capacity applies to users whether they interface with a GUI or a VUI. To understand cognitive capacity, let’s first look at how it’s different between the two.

A graphical user interface conveys a lot of information to a user. All the images, visual content and how they are presented in a state called permanence. That is, the images, content, and visuals presented in the graphic user interface used on a web page or within a mobile experience are static. Without a prompt, that content doesn’t disappear as a voice does.

Instead, the visual content experience holds that information like it’s sitting on a shelf. The words on a stop sign or the main menu of a website won’t disappear. You can look away, look back and they will likely be there. Because of permanence, you can look back to remind yourself of the exact content or take the time to look deeper for more information.

That’s not how voice works around cognitive capacity.

A voice response is fleeting. Not only is it gone immediately after being spoken, if the sentence or sentences are long enough, the exact words used are disappearing in the air and in the user’s mind even before the entire statement is spoken.

That’s because your brain can only focus on and process chunks of words (usually around 3 or 4) at a time. So the brain is often letting go of a chunk. While it’s still capturing the gist of that dropped chunk, it is losing its ability to remember the exact words it heard in that chunk.

This unloading frees up cognitive space to allow the brain to focus on the next chunk of words in the sentence. So as the sentence gets longer, the idea expressed in the voice response is becoming more abstract in the users’ mind as the exact words and the idea they hold are being separated.

That’s cognitive capacity. The total amount of information the brain is capable of retaining at any particular moment.

Cognitive capacity is the secret sauce behind the game of telephone. The game where you whisper a sentence to a friend, who tries to whisper the same sentence to a friend, and so on, and so on…

If you’ve played the game, you know the fun of it is that by the time the last person repeats the message, it’s laughably different than the original message.

How to manage cognitive capacity in voice responses.

Don’t expect users to remember a whole bunch of things.

Remember the telephone example? Keep your response short and focused. Try to communicate information in chunks where the user can remember the idea if not the words.

For example, if your voice assistant is telling the story of different aspects of the weather, it can say:

“The weather for X is sunny. The temperature is 80 degrees. There is a 40% chance of rain. Would you like more?”

Each line completes a thought and can build on the other chunks to complete a picture. And while the user may forget the exact words of the weather response, they retain the bigger concept to satisfy their request (Sunny. Could rain. Hot. I can get more details).

2. Write for speaking not writing.

Speaking and writing seem like they would work the same. They don’t.

Due to factors like permanence of visual communications or audio style and tone that can contextualize and pace voice delivery, phrases we are comfortable with when written can seem odd or jarring when spoken.

3. Use landmarking.

In voice, landmarking gives users, a premise or context in which to use the rest of the information that follows. The best example is having the voice assistant front-load the confirmation or answer to your question.

So instead of “Bob, Ted, Carol, and Alice will attend your meeting.“

“You have meeting confirmations from Bob, Mike, Jane, and Kim.”

Through landmarking, you set up the premise “this is about meeting confirmations” which then lets the mind focus more on the details of whose attending. Rather than hear details with extra mental energy spent by not yet knowing how Bob, Mike, Jane, and Kim are related to each other.

4. Use vocal punctuation to help users chunk concepts.

The pace, pauses, and emphasis in a voice delivery can help the user to hear breaks in content to be able to chunk ideas, things like extra punctuation can give the user extra time to digest a thought.

Even something as small as addressing the user with a comma:

“Brooks,”

That small pause tells me I’m about to get information which allows me to be ready for it. That’s opposed to hearing a voice response sentence start suddenly and having the user spend cognitive capacity being caught off guard and catching up to the spoken response.

The same goes with periods. Each period lets an idea breathe so your user can digest it.

These techniques can help you manage the information load users get when interacting with voice assistants.