Setting Up Voice Activity Detection on Linux
This guide provides detailed instructions on setting up Voice Activity Detection (VAD) on Ubuntu using the LLaMA CPP framework. VAD is essential for detecting the presence or absence of human speech in audio signals.
Install System Prerequisites
First, ensure your Ubuntu system is updated and has Python and Pip installed. Open a terminal and run the following commands:
sudo apt update
sudo apt upgrade
sudo apt install python3 python3-pip git
This will install the necessary system dependencies.
Install LLaMA CPP
Clone the LLaMA CPP repository from GitHub and navigate to the directory:
git clone https://github.com/your-repo/llama.cpp.git
cd llama.cpp
Replace the URL with the actual repository URL if you have a specific version or fork in mind.
Build the LLaMA CPP Library
After cloning the repository, build the LLaMA CPP library:
make
This command compiles the source code and prepares the library for use.
Install Required Python Libraries
To work with audio files and perform VAD, you will need additional Python libraries. Install them using Pip:
pip install numpy soundfile webrtcvad
The webrtcvad
library provides an efficient implementation of VAD.
Create a Python Script for Voice Activity Detection
Create a new Python script named vad.py
to implement voice activity detection:
nano vad.py
Paste the following code into the file:
import wave
import webrtcvad
import numpy as np
import soundfile as sf
# Function to read audio and perform VAD
def vad(audio_file):
# Open audio file
with wave.open(audio_file, 'rb') as wf:
sample_rate = wf.getframerate()
num_channels = wf.getnchannels()
num_frames = wf.getnframes()
audio = wf.readframes(num_frames)
# Convert audio to 16-bit PCM format
audio = np.frombuffer(audio, dtype=np.int16)
# Initialize VAD
vad = webrtcvad.Vad(1) # Mode 1: Less aggressive
# Split audio into frames
frame_duration = 30 # milliseconds
frame_size = int(sample_rate * frame_duration / 1000)
frames = [audio[i:i + frame_size] for i in range(0, len(audio), frame_size)]
# Detect voice activity
for i, frame in enumerate(frames):
is_speech = vad.is_speech(frame.tobytes(), sample_rate)
if is_speech:
print(f"Frame {i}: Speech detected")
else:
print(f"Frame {i}: No speech")
# Example usage
audio_file = 'path/to/your/audio.wav' # Replace with your audio file path
vad(audio_file)
This script performs the following steps:
- Loads an audio file and converts it to the required PCM format.
- Initializes the VAD instance with a specified aggressiveness level.
- Splits the audio into frames and checks for speech activity in each frame.
Run the Voice Activity Detection Script
Once everything is set up, you can run the voice activity detection script.
Make sure to replace path/to/your/audio.wav
with the actual path to your audio file:
python3 vad.py
The script will print whether speech is detected in each frame of the audio file.
Troubleshooting
If you encounter issues, ensure that:
- All libraries are correctly installed.
- The audio file is in a supported format (e.g., WAV).
- The audio file is properly formatted as 16-bit PCM.
Conclusion
You have successfully set up a Voice Activity Detection system on Ubuntu using LLaMA CPP and the WebRTC VAD. This setup allows you to detect human speech in audio signals and can be extended for various applications such as transcription, command detection, or filtering out background noise.