How to setup an AI text classification system on Linux

Setting up an AI text classification system on Ubuntu involves several steps, including installing necessary software, configuring the environment, preparing your data, building and training the model, and optionally deploying the model for real-time predictions. Below is a comprehensive, step-by-step guide to help you through the process.

Table of Contents

  1. Prerequisites
  2. Step 1: Update the System
  3. Step 2: Install Python and Essential Tools
  4. Step 3: Set Up a Virtual Environment
  5. Step 4: Install Required Python Libraries
  6. Step 5: (Optional) Install GPU Support
  7. Step 6: Prepare Your Dataset
  8. Step 7: Preprocess the Text Data
  9. Step 8: Build and Train the Classification Model
  10. Step 9: Evaluate the Model
  11. Step 10: Save and Load the Model
  12. Step 11: (Optional) Deploy the Model with Flask
  13. Summary

Prerequisites

Before you begin, ensure you have:

  • Ubuntu OS: This guide is tailored for Ubuntu 20.04 LTS and later versions.
  • Basic Knowledge of Terminal Commands: Familiarity with navigating the terminal and executing commands.
  • Sufficient Hardware: For deep learning tasks, a machine with a GPU (preferably NVIDIA) is recommended to speed up training. Otherwise, CPU-based training is also possible but slower.

Step 1: Update the System

First, update your package lists and upgrade existing packages to ensure your system is up-to-date.

sudo apt update
sudo apt upgrade -y

Step 2: Install Python and Essential Tools

Ubuntu comes with Python pre-installed, but it's often an older version. It's recommended to install the latest Python 3.x version.

2.1 Install Python 3 and pip

sudo apt install -y python3 python3-pip

2.2 Verify Python and pip Installation

python3 --version
pip3 --version

2.3 Install Additional Build Tools

sudo apt install -y build-essential libssl-dev libffi-dev python3-dev

Step 3: Set Up a Virtual Environment

Using a virtual environment isolates your project dependencies and avoids conflicts.

3.1 Install venv Module

sudo apt install -y python3-venv

3.2 Create a Virtual Environment

Navigate to your project directory or create one:

mkdir ~/text_classification
cd ~/text_classification

Create the virtual environment:

python3 -m venv myenv

Replace myenv with your preferred environment name.

3.3 Activate the Virtual Environment

source myenv/bin/activate

After activation, your terminal prompt will prepend with (myenv) indicating that the virtual environment is active.

3.4 Upgrade pip Inside the Virtual Environment

pip install --upgrade pip

Step 4: Install Required Python Libraries

Install the necessary libraries for machine learning, natural language processing (NLP), and data handling.

4.1 Install Core Libraries

pip install numpy pandas scikit-learn

4.2 Install NLP Libraries

pip install nltk spacy

4.3 Install Deep Learning Libraries

Depending on your preference, you can choose between TensorFlow and PyTorch. Here, we'll install both.

pip install tensorflow
pip install torch torchvision torchaudio

4.4 Install Transformers Library

pip install transformers

4.5 Install Additional Utilities

pip install matplotlib seaborn

4.6 (Optional) Install Gensim for Word Embeddings

pip install gensim

4.7 Download spaCy Language Model

python -m spacy download en_core_web_sm

4.8 Download NLTK Data

Launch Python shell:

python

Then, within Python:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
exit()

Step 5: (Optional) Install GPU Support

If your machine has an NVIDIA GPU and you wish to leverage it for faster training, install NVIDIA drivers and CUDA.

5.1 Check for NVIDIA GPU

lspci | grep -i nvidia

If you see output listing an NVIDIA GPU, proceed.

5.2 Install NVIDIA Drivers

sudo ubuntu-drivers devices

This command lists available drivers. Install the recommended driver:

sudo ubuntu-drivers autoinstall

After installation, reboot your system:

sudo reboot

5.3 Install CUDA Toolkit

To install the CUDA Toolkit, visit the CUDA Toolkit Download Page and follow the instructions for your specific version of Ubuntu.

For example, for Ubuntu 20.04, you can use the following commands:

wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda-repo-ubuntu2004-12-1-local_12.1.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-1-local_12.1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-12-1-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda

After installation, reboot the system:

sudo reboot

5.4 Verify CUDA Installation

After rebooting, verify if CUDA has been installed correctly by checking the CUDA version:

nvcc --version

Step 6: Prepare Your Dataset

The next step is to obtain or create a dataset for your text classification task. The dataset should contain labeled text data in a format like CSV, where each entry has a text field and a label field.

Example format of the dataset:

text,label
"This is a positive example.",positive
"This is a negative example.",negative
...

Ensure that your dataset is well-organized and cleaned for optimal performance of the classification model.

Step 7: Preprocess the Text Data

Text data often needs to be preprocessed before it can be fed into a model. Common steps include:

  • Tokenization (splitting text into words or subwords)
  • Lowercasing
  • Removing stop words
  • Lemmatization (reducing words to their base form)

You can use nltk and spaCy libraries for preprocessing:

Example:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Preprocessing function
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords
    tokens = [word for word in tokens if word.isalpha() and word not in stopwords.words('english')]
    
    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return tokens

Step 8: Build and Train the Classification Model

After preprocessing, you can build and train a classification model. For traditional machine learning models, you can use scikit-learn:

Example with a Logistic Regression Classifier:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Assuming your text data is in a list `texts` and labels in `labels`
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2)

# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Logistic Regression classifier
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Evaluate the model
accuracy = model.score(X_test_tfidf, y_test)
print(f"Model Accuracy: {accuracy}")

Step 9: Evaluate the Model

Once your model is trained, evaluate it using the test dataset to measure its accuracy and other performance metrics such as precision, recall, and F1-score. You can use scikit-learn’s evaluation functions:

Example:

from sklearn.metrics import classification_report

# Generate predictions
y_pred = model.predict(X_test_tfidf)

# Display classification report
print(classification_report(y_test, y_pred))

Step 10: Save and Load the Model

After training the model, you may want to save it for later use. You can save the model and the vectorizer using Python’s joblib or pickle modules:

Example with joblib:

import joblib

# Save model and vectorizer
joblib.dump(model, 'text_classifier_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

# Load the model and vectorizer later
model = joblib.load('text_classifier_model.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')

Step 11: (Optional) Deploy the Model with Flask

If you want to deploy your trained model as a web service, you can use Flask to create an API that accepts text input and returns predictions:

Example Flask Application:

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load the saved model and vectorizer
model = joblib.load('text_classifier_model.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    text = data.get('text')
    
    # Vectorize the input text
    vectorized_text = vectorizer.transform([text])
    
    # Get the prediction
    prediction = model.predict(vectorized_text)[0]
    
    return jsonify({'prediction': prediction})

if __name__ == '__main__':
    app.run(debug=True)

Summary

By following the steps outlined above, you can successfully set up an AI-based text classification system on Ubuntu. From system setup and Python environment preparation to building and deploying the model, this guide provides a complete roadmap for creating text classifiers using traditional machine learning or deep learning approaches.