Setting Up Automatic Speech Recognition on Ubuntu
This guide provides step-by-step instructions on setting up an Automatic Speech Recognition (ASR) system on Ubuntu using the Vosk API, a lightweight and versatile open-source speech recognition toolkit.
1. Install System Prerequisites
First, ensure that your Ubuntu system is updated and has Python and Pip installed. Open a terminal and run the following commands:
sudo apt update
sudo apt upgrade
sudo apt install python3 python3-pip git
This will install the necessary system dependencies.
2. Install Vosk API
The Vosk API supports multiple languages and is easy to use. You can install it directly using Pip:
pip install vosk
3. Install Additional Dependencies
You will need to install some additional libraries for audio processing. Install the following libraries:
sudo apt install alsa-utils sox libsox-fmt-all
4. Download a Vosk Model
Vosk provides several pre-trained models for speech recognition. You can download a model suitable for your language. For example, to download the English model, run the following commands:
# Create a directory for models
mkdir ~/vosk-models
cd ~/vosk-models
# Download the Vosk English model
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip
# Unzip the downloaded model
unzip vosk-model-en-us-0.22.zip
After downloading, you will have a directory named vosk-model-en-us-0.22
containing the model files.
5. Create a Python Script for Speech Recognition
Create a new Python script named asr.py
to perform speech recognition using the Vosk API:
nano asr.py
Then, paste the following code into the file:
import sys
import os
import wave
import json
import pyaudio
from vosk import Model, KaldiRecognizer
# Set the path to the Vosk model
model_path = "vosk-model-en-us-0.22"
# Load the Vosk model
if not os.path.exists(model_path):
print(f"Model not found: {model_path}")
sys.exit(1)
model = Model(model_path)
recognizer = KaldiRecognizer(model, 16000)
# Set up audio stream
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=8000)
stream.start_stream()
print("Listening...")
# Recognize speech
while True:
data = stream.read(4000, exception_on_overflow=False)
if recognizer.AcceptWaveform(data):
result = recognizer.Result()
print(json.loads(result)["text"])
else:
print(recognizer.PartialResult())
This script does the following:
- Loads the Vosk model for English.
- Sets up an audio input stream using PyAudio.
- Listens to the microphone input and recognizes speech in real-time.
- Prints the recognized text to the console.
6. Install PyAudio
To run the script, you need to install PyAudio. You can install it using Pip:
pip install pyaudio
7. Run the Speech Recognition Script
Once everything is set up, you can run the speech recognition script:
python3 asr.py
Speak into your microphone, and the recognized speech will be printed to the terminal in real-time.
8. Troubleshooting
If you encounter issues with microphone access, ensure that your microphone is properly configured and recognized by the system. You can check your audio devices using:
arecord -l
This command lists all available recording devices on your system. Make sure your microphone is set as the default input device in your audio settings.
9. Conclusion
You have successfully set up an Automatic Speech Recognition system on Ubuntu using the Vosk API. This setup allows you to recognize speech in real-time and can be expanded for various applications, including voice commands, transcription services, and more.