How to run an AI model using vLLM

In this tutorial, you'll learn how to set up and run an AI model locally on a Linux system using vLLM. By the end, you'll have a working environment where you can download and interact with models directly from Hugging Face, making it easier to experiment and utilize AI without relying on external cloud services. This guide covers everything from installing essential tools like Python PIP and Miniconda, to setting up vLLM, and running a model through simple commands in the Terminal. No matter your experience level, this step-by-step process ensures you can get started with local AI model execution.

What is vLLM?

vLLM is an open-source, high-performance runtime for serving large language models (LLMs). It allows you to efficiently load, run, and interact with AI models on your local machine or server. Built to optimize the speed and scalability of model inference, vLLM makes it easier to use large models like those from Hugging Face without requiring extensive resources or cloud-based solutions.

Running an AI model using vLLM (in a nutshell)

  1. Install Python PIP
  2. Download and install Miniconda
  3. Create and activate new environment in Miniconda
  4. Install vLLM
  5. Select desired model from Hugging Face
  6. Load model using Terminal command
  7. Test model by sending test prompt

Install Python PIP and Conda

In this video, you’ll learn how to install Python PIP and Miniconda on a Linux system using the Terminal. The tutorial walks you through each step in detail, starting with the installation of Python PIP directly from the Terminal with a simple command. From there, you'll open a browser and go to the Anaconda website to download the Miniconda installer. Once downloaded, the video shows you how to navigate to the download folder via the Terminal and execute the installation of Miniconda.

First, you'll need to install Python PIP. To do that, enter the following command into the Terminal, as seen in Figure 1:

sudo apt-get install python3-pip

Install PIP
Figure 1 - Install PIP

Navigate to the Miniconda website, and download Miniconda3 Linux 64-bit.

Download miniconda
Figure 2 - Download miniconda

To install Miniconda, navigate to the folder you saved the installation package in, and enter the command below to the Terminal, as shown in Figure 3:

sudo bash Miniconda3-latest-Linux-x86_64.sh

Install conda
Figure 3 - Install conda

When this message (Figure 4) pops up, type yes to accept the license terms, and proceed with the installation.

Accept license
Figure 4 - Accept license

Install vLLM

This video provides a step-by-step guide to installing vLLM on a Linux system using Miniconda. It begins by demonstrating how to create a new Conda environment through the terminal. After the environment is created, the video shows how to activate it with a specific command. Next, it walks through the installation of vLLM using another command, followed by a brief wait for the installation process to complete.

To create a new Conda environment, enter the following command, just like in Figure 5:

conda create -n myenv python=3.12 -y

Create a new conda environment
Figure 5 - Create a new conda environment

Next, activate the environment by pasting the code below (Figure 6):

conda activate myenv

Activate conda environment
Figure 6 - Activate conda environment

After that, install vLLM via this instruction, highlighted in red in Figure 7:

pip install vllm

Install vLLM
Figure 7 - Install vLLM

Start vLLM service

The last video in this article guides you through the process of downloading and running an AI model from Hugging Face using vLLM. It starts by opening a browser and navigating to huggingface.co. From there, the video demonstrates how to search for models. It specifically selects the Llama-3.2-3B-Instruct-uncensored model. After clicking Use this model, vLLM is chosen from the dropdown menu, and a command for loading and running the model is copied. The command is then pasted into the terminal, and the video shows the loading process. Once completed, a new terminal window is opened to send an HTTP POST request with a test prompt, and the final step verifies that the response is correct and coherent.

Navigate to Hugging Face, and search for your desired model. For the sake of this tutorial, we'll be using Llama-3.2-3B-Instruct-uncensored. Once found, click it, as seen in Figure 8.

Select model from Huggingface.co
Figure 8 - Select model from Huggingface.co

On the model's page, click the Use this model button located near the right edge of the screen. Select vLLM from the dropdown menu, just like in Figure 9.

Use this model in vLLM
Figure 9 - Use this model in vLLM

Copy the command under # Load and run the model:, as demonstrated by Figure 10.

Copy vLLM command
Figure 10 - Copy vLLM command

Paste the previously copied command into the Terminal (Figure 11).

Start model in vLLM
Figure 11 - Start model in vLLM

Wait for the loading process to complete. If you've done everything correctly so far, you are met with these lines, as they are in Figure 12.

Model started
Figure 12 - Model started

Open a new Terminal Window. Head back to Hugging Face, but this time, copy the other code block, under # Call the server using curl, and paste it into the new Terminal window.

This is a HTTP POST request containing your prompt. By default, it just says hello to the model. Modify it to your needs, then send it, as depicted in Figure 13.

Send request to model
Figure 13 - Send request to model

Following a successful installation, your windows should look like Figure 14. The original Terminal window confirms a successful HTTP request, and the new window shows the model's response.

Answer from model
Figure 14 - Answer from model

Summary

You've learned how to install Python PIP and Miniconda on Linux, set up a virtual environment, and install vLLM. You’ve also explored how to download and run AI models from Hugging Face, and send requests to interact with the model. With this setup, you're now equipped to run AI models locally, giving you more control and flexibility to explore different models and their outputs right from your system.

Can I do this on Windows?

Yes, via an Ubuntu WSL. Click to find out how to use vLLM on Windows.

More information