AI Token Classification Setup on Ubuntu
1. Install Python and Necessary Packages
You’ll need Python, pip
, and a virtual environment. If you don't have Python installed, you can install it using the following commands:
sudo apt update
sudo apt install python3 python3-pip python3-venv
2. Set Up a Virtual Environment
It's a good practice to use a virtual environment to manage dependencies. To set it up:
python3 -m venv token-classification-env
source token-classification-env/bin/activate
3. Install Required Libraries
The transformers
library from Hugging Face will be used for token classification. Install it along with PyTorch:
pip install torch transformers datasets
4. Prepare Your Dataset
Token classification involves labeling tokens (words) in a text. For example, to load the conll2003
dataset commonly used for Named Entity Recognition (NER):
from datasets import load_dataset
dataset = load_dataset("conll2003")
5. Load a Pre-trained Token Classification Model
You can load a pre-trained model like BERT fine-tuned for token classification:
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
6. Tokenize the Dataset
Before feeding your dataset into the model, it needs to be tokenized:
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
label_ids = [-100 if word_id is None else label[word_id] for word_id in word_ids]
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
7. Fine-tuning the Model (Optional)
If you want to fine-tune the model on your dataset, you can use the Trainer
API from Hugging Face:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
tokenizer=tokenizer
)
trainer.train()
8. Run Inference on New Text
Once the model is trained or loaded, you can use it to classify tokens in new text:
from transformers import pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)
sentence = "Hugging Face is creating a great library!"
ner_results = ner_pipeline(sentence)
for entity in ner_results:
print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']}")
9. Saving the Model
After training or fine-tuning, you can save the model for later use:
trainer.save_model("./ner_model")
10. Serve the Model (Optional)
If you want to serve the model as an API, you can use a lightweight framework like FastAPI. Install FastAPI and Uvicorn:
pip install fastapi uvicorn
Here’s a basic FastAPI setup:
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
ner_pipeline = pipeline("ner", model=model_name)
@app.post("/predict")
def predict(text: str):
ner_results = ner_pipeline(text)
return {"entities": ner_results}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
You can run this API with the following command:
uvicorn app:app --reload
Summary
You’ve now set up token classification on Ubuntu using Python and Hugging Face. You can use a pre-trained model or fine-tune your own, prepare datasets, train or load models, and optionally serve the model through an API.