Setting Up Voice Activity Detection on Linux

This guide provides detailed instructions on setting up Voice Activity Detection (VAD) on Ubuntu using the LLaMA CPP framework. VAD is essential for detecting the presence or absence of human speech in audio signals.

Install System Prerequisites

First, ensure your Ubuntu system is updated and has Python and Pip installed. Open a terminal and run the following commands:

sudo apt update
sudo apt upgrade
sudo apt install python3 python3-pip git

This will install the necessary system dependencies.

Install LLaMA CPP

Clone the LLaMA CPP repository from GitHub and navigate to the directory:

git clone https://github.com/your-repo/llama.cpp.git
cd llama.cpp

Replace the URL with the actual repository URL if you have a specific version or fork in mind.

Build the LLaMA CPP Library

After cloning the repository, build the LLaMA CPP library:

make

This command compiles the source code and prepares the library for use.

Install Required Python Libraries

To work with audio files and perform VAD, you will need additional Python libraries. Install them using Pip:

pip install numpy soundfile webrtcvad

The webrtcvad library provides an efficient implementation of VAD.

Create a Python Script for Voice Activity Detection

Create a new Python script named vad.py to implement voice activity detection:

nano vad.py

Paste the following code into the file:

import wave
import webrtcvad
import numpy as np
import soundfile as sf

# Function to read audio and perform VAD
def vad(audio_file):
# Open audio file
with wave.open(audio_file, 'rb') as wf:
    sample_rate = wf.getframerate()
    num_channels = wf.getnchannels()
    num_frames = wf.getnframes()
    audio = wf.readframes(num_frames)

# Convert audio to 16-bit PCM format
audio = np.frombuffer(audio, dtype=np.int16)

# Initialize VAD
vad = webrtcvad.Vad(1)  # Mode 1: Less aggressive

# Split audio into frames
frame_duration = 30  # milliseconds
frame_size = int(sample_rate * frame_duration / 1000)
frames = [audio[i:i + frame_size] for i in range(0, len(audio), frame_size)]

# Detect voice activity
for i, frame in enumerate(frames):
    is_speech = vad.is_speech(frame.tobytes(), sample_rate)
    if is_speech:
        print(f"Frame {i}: Speech detected")
    else:
        print(f"Frame {i}: No speech")

# Example usage
audio_file = 'path/to/your/audio.wav'  # Replace with your audio file path
vad(audio_file)

This script performs the following steps:

  • Loads an audio file and converts it to the required PCM format.
  • Initializes the VAD instance with a specified aggressiveness level.
  • Splits the audio into frames and checks for speech activity in each frame.

Run the Voice Activity Detection Script

Once everything is set up, you can run the voice activity detection script. Make sure to replace path/to/your/audio.wav with the actual path to your audio file:

python3 vad.py

The script will print whether speech is detected in each frame of the audio file.

Troubleshooting

If you encounter issues, ensure that:

  • All libraries are correctly installed.
  • The audio file is in a supported format (e.g., WAV).
  • The audio file is properly formatted as 16-bit PCM.

Conclusion

You have successfully set up a Voice Activity Detection system on Ubuntu using LLaMA CPP and the WebRTC VAD. This setup allows you to detect human speech in audio signals and can be extended for various applications such as transcription, command detection, or filtering out background noise.