Translation Model Setup on Ubuntu

1. Install Python and Required Tools

First, ensure that Python, pip, and venv are installed on your system. You can install them using the following commands:

sudo apt update
sudo apt install python3 python3-pip python3-venv

    

2. Create and Activate a Virtual Environment

It is recommended to create a virtual environment to manage your project dependencies. Run the following commands to create and activate the environment:

python3 -m venv translation-env
source translation-env/bin/activate

    

3. Install TensorFlow and Hugging Face Transformers

Next, you will need to install TensorFlow and the Hugging Face Transformers library for the translation task. Run the following command:

pip install tensorflow transformers sentencepiece

    

4. Load a Pre-trained Translation Model

For translation tasks, we can use a pre-trained model such as MarianMT or T5. Here, we will use the MarianMT model from Hugging Face, which supports translation between various languages. The following code shows how to load the MarianMT model for translation:

from transformers import MarianMTModel, MarianTokenizer

# Load the pre-trained MarianMT model and tokenizer
model_name = "Helsinki-NLP/opus-mt-en-de"  # English to German model
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

    

5. Prepare Input Text for Translation

Now, define the input text that you want to translate. You will need to tokenize the input text before passing it to the model:

# Define the text to be translated
input_text = ["Hello, how are you?", "This is a translation example."]

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="tf", padding=True)

    

6. Perform Translation Using TensorFlow

Pass the tokenized input to the translation model to get the translated text:

# Generate translation
translated = model.generate(**inputs)

# Decode the translated text
translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
print(translated_text)

    

7. Example Output

After running the above code, you should get the following translated text:

['Hallo, wie geht es Ihnen?', 'Dies ist ein Übersetzungsbeispiel.']

    

8. Fine-tuning the Translation Model (Optional)

If you want to fine-tune the MarianMT model on your custom dataset, you can use Hugging Face’s Trainer API or TensorFlow’s Model.fit() method. The following example shows how to fine-tune the model using a custom dataset:

from datasets import load_dataset

# Load a custom dataset
dataset = load_dataset("your_dataset_name")

# Prepare the dataset for tokenization
def preprocess_function(examples):
    inputs = [ex["source_text"] for ex in examples["translation"]]
    targets = [ex["target_text"] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

encoded_dataset = dataset.map(preprocess_function, batched=True)

# Fine-tuning with Model.fit() in TensorFlow
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
              loss=model.compute_loss)  # Hugging Face models output logits, so we use the built-in loss

model.fit(encoded_dataset["train"], epochs=3, batch_size=8)

    

9. Save the Model

After fine-tuning the model, you can save it for future use:

# Save the fine-tuned model
model.save_pretrained("./translation-model")
tokenizer.save_pretrained("./translation-tokenizer")

    

10. Deploy the Model for Inference

If you want to deploy the model for translation inference, you can serve it as an API using FastAPI. First, install FastAPI and Uvicorn:

pip install fastapi uvicorn

    

Then, create a simple FastAPI app to serve the translation model:

from fastapi import FastAPI
from transformers import MarianMTModel, MarianTokenizer

app = FastAPI()

# Load the translation model
model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

@app.post("/translate")
def translate(text: str):
    # Tokenize input
    inputs = tokenizer([text], return_tensors="tf", padding=True)
    translated = model.generate(**inputs)
    translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
    return {"translated_text": translated_text}

# Run the server with Uvicorn
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

    

You can run this API server using the following command:

uvicorn app:app --reload