Today I woke up and felt like talking to websites. Instead of clicking a CTA button or an 'Add to cart' button, I wanted to be able to say it in words just like you would interact with Siri, Google Assistant, Alexa, etc. This technology is called Voice User Interface (VUI). I used the native Web SpeechRecognition API to implement a basic demo app with JavaScript. In this article, we'll delve into VUIs, their benefits, and how they work, culminating with a simple but functional demonstration of a web-based VUI. So, if you've ever wondered about giving your websites a 'voice,’ read on!
What are Voice User Interfaces (VUIs)?
Voice User Interfaces, or VUIs for short, are like digital helpers that you talk to. They work by listening to your voice commands and doing what you ask them to do. You can find VUIs in many places these days.
For instance, when you ask your phone for the latest weather update or when you ask your smart speaker to play your favorite song, you're using a VUI. They're also in-car systems where you can control navigation, music, or make calls without taking your hands off the steering wheel.
Some popular examples of VUIs are Amazon's Alexa, Google Assistant, Siri, and Cortana. These are complex VUIs that can do many different things, but VUIs can also be simple, doing just one or two tasks. For example, a voice-controlled light switch in your home is also a VUI.
VUIs in Websites. Why?
I believe VUIs can transform the user experience of websites, and these are my reasons:
Accessibility: VUIs make websites more accessible to a wider audience. They enable people with visual impairments or mobility issues to easily navigate and interact with a website. This is a major step forward in inclusive web design.
Convenience: VUIs offer a hands-free way of browsing a website. For example, users can search for a product, read reviews, and even make a purchase on an e-commerce site without ever having to touch the keyboard or mouse.
Efficiency: VUIs can simplify complex tasks on a website. Instead of clicking through various pages and forms, users can accomplish tasks quickly through voice commands.
Natural Interaction: VUIs make interaction with websites more intuitive. Speaking is a natural way of communication, and VUIs bring this naturalness to website interaction.
Improved User Engagement: VUIs can lead to increased user engagement. They can turn a passive browsing experience into an active conversation, making users feel more connected to the website.
Incorporating VUIs into websites can enhance the user experience by making them more accessible, convenient, efficient, and engaging. It is expected that as more users become familiar with voice assistants like Siri and Alexa, the expectation for voice interaction on websites will only grow.
The Components of VUIs
We need to employ a combination of technologies to bring voice interaction to a website. Let's take a closer look at the main components that make Voice User Interfaces (VUIs) possible on the web:
Web Speech Recognition API:
This is the first building block in creating a VUI for a website. The Web Speech Recognition API, native to modern web browsers, is designed to convert spoken language into written text. When a user issues a command such as "search for blue shoes," the Speech Recognition API transcribes this spoken input into written text. It's important to note that while this API is powerful, it could be better and can sometimes struggle with accents, background noise, and complex phrases. However, for simple and clear commands, it serves as a solid starting point for implementing VUIs in a web environment.
This is how it works:
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = true;
recognition.interimResults = true;
recognition.onresult = function(event) {
let last = event.results.length - 1;
let transcript = event.results[last][0].transcript;
if (transcript.includes('button')) { }
};
recognition.start();
The process starts by initiating a new instance of the SpeechRecognition
object. With the new instance ready, its settings are adjusted to match the needs of continuous and interim speech recognition. The continuous
property is set to true
, ensuring that the speech recognition continues without automatically stopping. Another property, interimResults
, is also set to true
. This allows the system to provide results even as the user is still speaking, instead of waiting for complete sentences.
Next, an onresult
function is declared on the recognition object. This function is triggered when the system successfully transcribes speech into text. Inside this function, the transcribed text, referred to as the transcript, is extracted from the event results.
The transcript is then examined for the presence of a specific command. In this case, the command is 'button'. A related action is triggered if this command is found within the transcript. Although this is a relatively simple example, more sophisticated systems could employ natural language processing to understand the meaning behind the command and use web automation to act on the webpage.
With all the components set, the speech recognition process is kickstarted by calling the start
method on the recognition object. This sets the voice user interface into motion, enabling it to start listening for and processing the spoken commands.
Natural Language Processing (NLP):
After converting spoken words into written text using the Speech Recognition API, the next step in building a web VUI is to make sense of the user's command. This is where Natural Language Processing (NLP) comes in. NLP is a field of Artificial Intelligence that enables computers to understand, interpret, and generate human language in a valuable way. It can be used to parse the transcribed speech, identify key commands or intents, and even handle more complex conversational contexts. For instance, if a user says, "Show me the latest blog posts, and then sort them by popularity", an NLP system can break this down into two separate commands: displaying the latest blog posts and sorting the results.
While full-fledged NLP systems might seem complex, many libraries and APIs can help, such as Google Cloud Natural Language API, Microsoft's Azure Language Understanding (LUIS), or open-source libraries like NLTK and spaCy. These tools can greatly simplify the task of language understanding, providing pre-built models for tasks like entity recognition, sentiment analysis, and more.
Also, you can use complex Large Language Models (LLMs) like OpenAI's GPT to understand, interpret, and generate human language in a valuable way. These models have been trained on vast amounts of text data, enabling them to generate contextually relevant responses based on their input.
In the context of a VUI, LLMs like GPT can be utilized to understand the intent behind the user's command and generate appropriate responses. For example, if a user says, "Show me the latest blog posts and then sort them by popularity," an NLP system powered by an LLM can parse and understand the multiple actions requested in this single command. It can provide relevant responses or ask clarifying questions if the command is ambiguous.
In this basic VUI demo context, we're not employing any advanced NLP. We check if the transcribed text includes the word 'button'. This can be seen in the code snippet from the previous section:
recognition.onresult = function(event) {
let last = event.results.length - 1;
let transcript = event.results[last][0].transcript;
if (transcript.includes('button')) {
// Perform action related to the command
}
};
The onresult
function is examining the transcribed text for a specific command, in this case, 'button'. If the command is present, an action is triggered. In a more complex system, this is where an NLP library would be used to understand the intent behind the user's command.
It's worth noting that implementing NLP can add significant value to a VUI by enabling it to handle complex commands and provide more natural, conversational interactions. However, it also adds a layer of complexity to the system and may only be necessary for some use cases.
Web Automation:
Web Automation is the final step in the process of translating a user's understood command into actionable behavior on a web page. This critical component utilizes tools such as Selenium or Puppeteer to enable programmatic control of web browsers. With the help of web automation, the Voice User Interface (VUI) system can interact with various elements of the web page, replicating human-like behavior. For instance, if the user commands the system to "Click me," web automation tools can be employed to locate the "Click me" button on the web page. Once identified, the system can trigger the button's onClick
event, mimicking the action of a user physically clicking the button.
By utilizing web automation, the VUI system can navigate through different web page elements, interact with forms, submit data, scroll, and perform other actions that users typically perform manually. This integration seamlessly integrates the VUI system with the web application, facilitating a smooth and interactive user experience.
Selenium and Puppeteer are two popular tools offering extensive web automation capabilities. They provide APIs that allow developers to script and control browsers programmatically. These tools enable the execution of complex actions, handling dynamic elements, and asynchronous task performance. Consequently, the VUI system can effectively execute user commands and establish a fluid interaction between voice input and corresponding actions on the web page.
By leveraging the power of web automation, VUIs can automate repetitive tasks, enhance user interactivity, and streamline workflows within web applications. Combining speech recognition, natural language processing, and web automation empowers developers to create sophisticated voice-driven interfaces that elevate the overall user experience.
Challenges in Implementing VUIs on Websites
However, they are limitations and challenges that make implementing VUIs on websites difficult. I outline some of the relevant issues in this section.
Handling Dynamic Web Elements
Web applications often contain dynamic elements that can change in real-time or based on user interactions. For example, if a user says, "Click the second button," but the web page dynamically reorders the buttons, the VUI system must adapt to these changes and still perform the intended action accurately. Handling dynamic elements requires robust strategies, such as utilizing unique identifiers or utilizing DOM traversal techniques to locate elements reliably.
Ambiguous Commands
Spoken commands can sometimes be ambiguous or need more context for the system to understand the user's intent clearly. For instance, if a user says, "Delete it," without specifying what needs to be deleted, the VUI system needs to ask clarifying questions or make educated guesses based on the context. Resolving ambiguity requires advanced natural language processing techniques, context-aware algorithms, and user prompts to ensure accurate interpretation and execution of commands.
Accents and Background Noise
Speech recognition accuracy can be affected by various factors, such as regional accents or background noise. Different accents can pose challenges for speech recognition systems, as they might need help to accurately transcribe words. Additionally, background noise can interfere with speech clarity, leading to misinterpretation of commands. Overcoming these challenges requires robust speech recognition algorithms and techniques, including accent-specific training data and noise reduction algorithms.
There are other challenges, such as user adaptation and integration with existing systems. These challenges can be resolved with robust algorithms, considering user-centric design principles, and leveraging advancements in speech recognition and natural language processing technologies.
Demonstration: A Simple VUI with Speech Recognition
I built a simple VUI with speech recognition. Here’s the demo:
This simple demonstration of a Voice User Interface (VUI) using Speech Recognition technology. The application listens to the user's spoken commands and performs actions on the user interface based on those commands.
In this basic example, the application listens for the "Button" command. When this command is recognized, it simulates a click on a button in the interface.
This demonstration showcases several important aspects of VUI design:
Speech recognition: The application listens to the user's spoken commands and accurately transcribes them into text.
Command processing: The application interprets the recognized text and maps it to an action.
Action execution: The application performs the action on the user interface (in this case, clicking a button).
This demo is a good starting point for anyone interested in developing more complex voice-interactive applications. While simple, it captures the fundamental process of turning speech into action.
Here is the code on Github:
Wrapping up…
The potential of VUIs extends beyond accessibility and convenience. They can provide a more seamless and engaging user experience, allowing users to interact with websites and applications more naturally and conversationally. This can lead to increased productivity, reduced cognitive load, and enhanced accessibility for users with disabilities.
As technology progresses and our understanding of human-computer interaction deepens, the future of VUIs looks promising. It is an exciting time to explore, innovate, and shape the future of web-based voice interfaces.