Designing for Voice – Is there such a thing as too effortless?

What’s the best control scheme for recording voice?

Is it better to push to talk, to hold on to a button and record or should it be tap to start, where the user hits record to start and stops when they hit stop? This is one of the recurring arguments we’ve had for the last year, this one simple question has plagued us through multiple versions of the app.

For us, the best recording model is one that feels the most natural and effortless for our users. We’ve looked at other apps and platforms to draw inspiration and guidance however even the industry seems fractured. Most of the digital assistants allow users to invoke recording through a simple gesture and then allow them to speak hands-free. While messaging apps that use voice often require users to hold a button as they speak.

Both of these methods have their pros and cons depending on what you want to do with your voice. The flexibility of hands-free recording provides users the ability to freely dictate their thoughts unimpeded but people also have a tendency to ramble. Having users to hold a button to record forces them be clear and concise about what they want to say but it also adds to their cognitive load.

In the first two versions of Verbz, we went with the push to talk approach as we were functionally on the messaging side of the voice space (chat and email). Voice recordings were quick and effortless as we had full control of the start and stop. In version three, we explored a tap to start model as we moved towards the digital assistant space. We thought that it would be easier and quicker for users to tap and start talking as they went about their day. However based on our learnings, in our next release, we’ll be moving back to the push to talk approach model.

So tell me more…and more…and more…

We learned in version three that the tap to start model was a bit too quick for users and caused anxiety for users trying to record their thoughts. We accidentally made the experience feel hostile to the user. The idea that the device is continuously listening and always ready to turn your voice into data is good on paper however when you add humans into the mix we start to see issues arise. People don’t think about things in a continuous stream, rather we tend to think in blocks and then verbalize those blocks. By having this entity that was always listening and hungry for more made V3 users anxious in thinking that they weren’t thinking and verbalizing fast enough. Moving back to the push to talk model gives our users the opportunity to pause and resume quickly as they gather their thoughts and figure out what needs to be verbalized without the persistent nagging of the system.

I’m hopeful that version 4 will finally put this argument to rest but it could also just be another data point in this contentious battle. If you’re interested in seeing how this all works out, join our sign-up list and be one of the first version 4 users to provide us feedback.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s