AI Token Classification Setup on Ubuntu

1. Install Python and Necessary Packages

You’ll need Python, pip, and a virtual environment. If you don't have Python installed, you can install it using the following commands:

sudo apt update
sudo apt install python3 python3-pip python3-venv

    

2. Set Up a Virtual Environment

It's a good practice to use a virtual environment to manage dependencies. To set it up:

python3 -m venv token-classification-env
source token-classification-env/bin/activate

    

3. Install Required Libraries

The transformers library from Hugging Face will be used for token classification. Install it along with PyTorch:

pip install torch transformers datasets

    

4. Prepare Your Dataset

Token classification involves labeling tokens (words) in a text. For example, to load the conll2003 dataset commonly used for Named Entity Recognition (NER):

from datasets import load_dataset

dataset = load_dataset("conll2003")

    

5. Load a Pre-trained Token Classification Model

You can load a pre-trained model like BERT fine-tuned for token classification:

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

    

6. Tokenize the Dataset

Before feeding your dataset into the model, it needs to be tokenized:

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [-100 if word_id is None else label[word_id] for word_id in word_ids]
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

    

7. Fine-tuning the Model (Optional)

If you want to fine-tune the model on your dataset, you can use the Trainer API from Hugging Face:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer
)

trainer.train()

    

8. Run Inference on New Text

Once the model is trained or loaded, you can use it to classify tokens in new text:

from transformers import pipeline

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

sentence = "Hugging Face is creating a great library!"
ner_results = ner_pipeline(sentence)

for entity in ner_results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']}")

    

9. Saving the Model

After training or fine-tuning, you can save the model for later use:

trainer.save_model("./ner_model")

    

10. Serve the Model (Optional)

If you want to serve the model as an API, you can use a lightweight framework like FastAPI. Install FastAPI and Uvicorn:

pip install fastapi uvicorn

    

Here’s a basic FastAPI setup:

from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()

ner_pipeline = pipeline("ner", model=model_name)

@app.post("/predict")
def predict(text: str):
    ner_results = ner_pipeline(text)
    return {"entities": ner_results}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

    

You can run this API with the following command:

uvicorn app:app --reload

    

Summary

You’ve now set up token classification on Ubuntu using Python and Hugging Face. You can use a pre-trained model or fine-tune your own, prepare datasets, train or load models, and optionally serve the model through an API.