Today's voice search technology can't quite read our minds, but it has the power to learn to understand them. Right now, nearly half of all searches in the US are voice-activated. Around the globe, nearly a third of global internet users rely on voice search or virtual assistants to get information, shop and communicate.
AI speech recognition is seeing even more prominence in voice assistants, smart home devices, search engines, and other devices and applications we use to interact with the internet.
But for voice search technology to be useful to every potential user, it needs to be able to recognize and respond to any user.
If you've used voice search, you've likely had the experience of saying something to a smart device only to have it misunderstand you repeatedly. And automatic speech recognition software makes far more errors when the speaker is not white–usually because the algorithm hasn't been trained on enough diverse voice data.
To fix this problem, AI needs human intervention. Developers need to train their speech recognition algorithms and neural networks to identify a wide variety of speech patterns, timbres, accents, disabilities, and languages.
| 43% of customers are using voice search to shop
And as voice commerce continues to grow, there are benefits for companies who can use accessible, expansive speech recognition technology to engage with customers. For example, 43% of customers are using voice search to shop–and when they do, they're primarily looking for information about promotions or planning to make a call. So, a voice search technology that creates a smooth user experience for the largest number of users can also drive customer engagement.
Types of speech recognition technology
You'll find speech recognition software in a few key places, each of which uses voice data slightly differently.
In voice search applications, the software automatically recognizes keywords or phrases and uses them to power information searches. Voice data for searches tend to be questions or requests for concrete data: "What is tomorrow's weather?" "Where is the nearest ice cream shop?" "Who won the gold medal in the 2018 Olympics?"
Virtual assistants listen for voice commands and translate them into actions–as simple as "set a timer for 20 minutes" or as complex as playing a game of trivia with the assistant on your smart device.
These voice services try to imitate natural voice experiences and engagement patterns. Many have names that when spoken, trigger their processing power–though it's worth noting that the majority of virtual assistants have feminine names, which raises questions about the role of gender stereotypes in software development.
Speech to text
Speech-to-text has a number of uses, all of which convert the spoken word into text that appears on the screen. Users rely on speech-to-text to write text messages, compose emails, perform searches online, and more. Speech-to-text software can also be used to generate digital transcripts from audio recordings or close-caption an event in real-time.
The back-end of most speech recognition tech is similar–the speech recognition machine learns from its inputs to generate better outputs. So, to make sure that a voice recognition engine can understand users around the globe and respond in an appropriate, localized way, these engines need to practice with voice data gathered from speakers with a variety of accents, speech patterns, regional variations, and more.
Putting the human back into speech recognition AI
While a speech recognition engine can improve itself over time, a steady diet of clean data will help speed up the process. How do you source clean speech data for your AI model training? That's something that only a human can do.
The cleanest data comes through human intervention. Using a human-in-the-loop process of supervised data collection will ensure that your machine is being fed a range of data. And human intervention can help correct potential biases in the development and training stages as well.
For example, Venga has worked on projects where our clients needed a large amount of voice data from individuals who had speech disabilities. We recruited speakers to record themselves reading set phrases. Developers were then able to feed the voice data back into the algorithm, giving it clean audio references to match to the meaning so that in the future it would be better at understanding people with similar speaking styles.
If diverse audio data isn't readily available, linguists may be able to translate voice data from one language to another or produce data sets in various regional accents.
And once the data has been collected, digital tools can help transcribe, annotate, create descriptive meta tags and classify audio content so speech recognition engines can more easily crawl through audio and process it.
With all these tools, training your AI on diverse voice data is handily within reach. The result? Smarter, faster, more inclusive speech recognition.