How to run vLLM on Windows using WSL

In this article, you’ll learn how to set up vLLM on Windows using WSL. We’ll cover everything from installing the necessary Nvidia GPU drivers to running a language model and sending a prompt through Postman. By following along with the included tutorial videos, you’ll be guided through each step, including installing WSL Ubuntu, setting up Python PIP and Miniconda, and finally running your preferred model from Hugging Face via vLLM. Whether you’re looking to optimize your workflow or get hands-on experience with language models, this article will walk you through the entire process.

What does WSL stand for?

Windows Subsystem for Linux (WSL) is a compatibility layer that allows you to run Linux distributions directly on your Windows system, without the need for a virtual machine. It provides developers with the flexibility of accessing Linux-based tools, scripts, and applications while maintaining the familiarity of a Windows environment.

Running an LLM on Windows using WSL (cheat sheet)

  1. Download Nvidia GPU driver
  2. Install driver via Express Installation
  3. Install WSL Ubuntu from Microsoft Store
  4. Install CUDA toolkit on WSL
  5. Install Python PIP on WSL
  6. Download and install Miniconda
  7. Create and activate Conda environment
  8. Install vLLM
  9. Search for model on Hugging Face
  10. Run model via terminal
  11. Create POST request in Postman
  12. Send prompt and receive model response

Step 1: Download and install Nvidia GPU driver

In this video, you'll learn how to install or update your Nvidia GPU drivers step-by-step. The tutorial begins by navigating to Nvidia's official download page, where you'll select your GPU's series, product, and operating system. After downloading the appropriate driver, you'll run the installer, choose a folder to save the installation files, and opt for the Express installation option. The video walks you through the installation process, ensuring you're quipped with the latest drivers. Once the installation is complete, you'll simply close the installer.

First, head to the Nvidia downloads page. Select the right driver to download by choosing from the dropdown lists. In our case, that's the drivers for the GeForce RTX 3090 GPU on Windows 10, as you can see in Figure 1. Once ready, click Search.

Search driver to video card
Figure 1 - Search driver to video card

Double check if you have selected the correct driver. If so, click Download, as highlighted in Figure 2.

Download driver
Figure 2 - Download driver

Start the download by pressing the Download button, as demonstrated by Figure 3.

Start download
Figure 3 - Start download

Wait for the download to finish, and open the installer (Figure 4).

Open installer
Figure 4 - Open installer

Select Express Installation, click Next, showcased in Figure 5.

Insatll options
Figure 5 - Install options

Wait for the installation process to finish, similarly to Figure 6.

Install process
Figure 6 - Install process

If everything went smoothly, you are met with a window similar to Figure 7. Click Close.

Install finished
Figure 7 - Driver install finished

Step 2: Install WSL Ubuntu

In the next video, you'll be taught to install WSL Ubuntu on your Windows system using the Microsoft Store. The tutorial walks you through searching for Ubuntu LTS in the store, selecting your preferred version, and clicking Get to begin the installation. After waiting for the installation to finish, the video shows how to open the newly installed Ubuntu application, enabling you to start using Linux on your Windows machine through WSL.

Open Microsoft Store, and search for Ubuntu LTS. For today's guide, we'll be using Ubuntu 24.04.1 LTS, as illustrated in Figure 8. Once on the store page of your desired version, click Get and wait for the installation to finish.

Download WSL form Microsoft store
Figure 8 - Download WSL form Microsoft store

Open it. There are multiple ways to do this, this time we are going to hit the Windows key, and type ubuntu into the search field (Figure 9).

Open WSL Ubuntu
Figure 9 - Open WSL Ubuntu

If you see a command line interface like Figure 10, you have been successful so far.

WSL Ubuntu running
Figure 10 - WSL Ubuntu running

Step 3: Install CUDA toolkit on WSL

In this video, you'll learn how to install Nvidia's CUDA toolkit on your WSL interface. The tutorial starts by navigating to the CUDA toolkit downloads page, where you'll select the appropriate Operating System, Architecture, Distribution, Version, and Installer type. After that, you'll copy and paste each installation command from the page into your WSL interface, one by one, and wait for the individual installations to finish.

Head to CUDA Toolkit downloads page. Select operating system, architecture, distribution, version and installer type. In our case, those in order would be: Linux, x86_64, WSL-Ubuntu, 2.0, deb (local).

You should get a list of commands, similarly to the one in Figure 11, or the ones below.

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda-repo-wsl-ubuntu-12-6-local_12.6.2-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-6-local_12.6.2-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-6-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-6

Run CUDA Toolkit install commands
Figure 11 - Run CUDA Toolkit install commands

Run each command line individually, and wait for the installations to finish, as seen in Figure 12.

CUDA toolkit installation finished
Figure 12 - CUDA toolkit installation finished

Step 4: Install Phyton PIP and Conda environment

In this video tutorial, you are guided through the process of installing Python PIP and Miniconda on Ubuntu WSL. It begins with a demonstration of how to install Python PIP using the command line. Once the command is executed, the viewer is prompted to confirm the installation by entering 'yes' and waiting for the process to complete.

Next, the tutorial shows how to navigate to the Miniconda downloads page and copy the link address for the Linux 64-bit installer. The video continues with downloading Miniconda via the "wget" command, followed by the installation process. Viewers are instructed to follow the on-screen prompts during the installation, waiting until it is fully completed before concluding.

Enter the command below into the Terminal to start Python PIP's download (Figure 13).

sudo apt-get install python3-pip

Install phyton pip
Figure 13 - Install Phyton PIP

On the Miniconda downloads page, look for the Linux 64-bit installer, right click on it, and select Copy link address from the dropdown list. Open the terminal, enter wget and paste the link on your clipboard, highlighted in Figure 14.

Alternatively, you can copy and paste the command below to do the same.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Download conda install script
Figure 14 - Download conda install script

Run the install script you've just downloaded by pasting the command below into the Terminal, as illustrated by Figure 15.

sudo bash Miniconda3-latest-Linux-x86_64.sh

Run conda install script
Figure 15 - Run conda install script

Follow the instructions of the installer, and wait for the process to finish. When done, your Terminal should look similar to Figure 16.

Conda install finished
Figure 16 - Conda install finished

Step 5: Install and run vLLM

In this video tutorial, the viewer learns how to run their preferred LLM and send a prompt to it via an HTTP POST request using Postman. The video begins with creating and activating a new Conda environment, followed by the installation of vLLM through a command. Once the environment is set, the viewer is directed to Hugging Face to search for their desired model, in this case, Llama-3.2-3B-Instruct-Uncensored.

The tutorial demonstrates how to navigate the model's page, click Use this model, and select vLLM from the dropdown. It then shows how to copy the command under # Load and run the model and execute it in the terminal. After waiting for the model to load, the focus shifts to Postman, where a new POST request is created.

The viewer is shown how to copy the address and request body, including the prompt, from the model's page into Postman. They can then modify the prompt to fit their needs and send the request. Finally, the tutorial checks whether the model's response is accurate and coherent.

Create a new Conda environment by pasting the command below into the Terminal, as highlighted in Figure 17.

conda create -n myenv python=3.12 -y

Create conda environment
Figure 17 - Create conda environment

Then, activate this new environment by pasting the next piece of code into the Terminal, as you can see in Figure 18.

conda activate myenv

Activate conda environment
Figure 18 - Activate conda environment

After that, install vLLM using this command (Figure 19):

pip install vllm

Install vLLM
Figure 19 - Install vLLM

Navigate to the page of your preferred model on Hugging Face, and click the Use this model button near the right edge of the screen. Select vLLM from the dropdown list, highlighted by a red arrow in Figure 20.

Use model in vLLM
Figure 20 - Use model in vLLM

From the pop-up window, copy the command under the # Load and run the model command, signified by a bright red frame in Figure 21.

Alternatively, if you're using the exact same model as we do in this tutorial, you may copy the following command:

vllm serve "chuanli11/Llama-3.2-3B-Instruct-uncensored"

Copy run command
Figure 21 - Copy run command

Paste the content of your clipboard to the Terminal window, and execute the command. Wait for the model to load, as seen in Figure 22.

Run vLLM
Figure 22 - Run vLLM

Open Postman, create a new POST request. Copy the address and the request body from Hugging Face, each to their respective fields, just like in Figure 23.

Adjust the test prompt to your needs. For the sake of this tutorial, we'll be asking where's Budapest.

Send request to model
Figure 23 - Send request to model

Send the request, and check the response in Postman. If you get a reply, and it's the correct answer to your question, you have completed our tutorial (Figure 24).

Answer received from model
Figure 24 - Answer received from model

Closing thoughts

By the end of this article, you’ve gained the knowledge needed to run vLLM on Windows through WSL. You’ve learned how to install Nvidia GPU drivers, set up WSL Ubuntu, and install both Python PIP and the Miniconda environment. On top of that, you've seen how to load a model from Hugging Face, send a prompt to it using Postman, and verify the model’s response. With these steps under your belt, you’re prepared to effectively run language models using this versatile setup.

Are there easier ways to run an LLM locally on Windows?

Yes, read this article to find out how you can run LLMs on Windows, using Ozeki AI Studio.

More information