Setting Up Sentence Similarity on Linux

This guide will walk you through setting up sentence similarity on an Ubuntu system using TensorFlow and Hugging Face’s transformers library. You'll be able to compute similarity between two sentences using AI-based models like BERT or Universal Sentence Encoder.

1. Install System Prerequisites

Before you begin, make sure your system is up-to-date and has essential tools installed. Open a terminal and run the following commands:

sudo apt update
sudo apt upgrade
sudo apt install python3 python3-pip python3-venv build-essential
    

This ensures that Python 3, Pip, and build tools are installed.

2. Create a Virtual Environment (Optional)

It's recommended to use a virtual environment to manage dependencies separately. Run the following commands to create and activate a Python virtual environment:

python3 -m venv sentence-sim-env
source sentence-sim-env/bin/activate
    

This creates and activates a virtual environment named sentence-sim-env.

3. Install TensorFlow and Hugging Face Transformers

Next, install TensorFlow and the Hugging Face transformers library, which provides pre-trained models for sentence similarity:

pip install tensorflow
pip install transformers
    

This installs TensorFlow and the Hugging Face Transformers library, both of which are required for loading pre-trained sentence similarity models.

4. Choose a Pre-Trained Model for Sentence Similarity

You can use a pre-trained model like BERT or Universal Sentence Encoder (USE) to compute sentence similarity. Below is an example using Universal Sentence Encoder from TensorFlow Hub:

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

# Load the Universal Sentence Encoder model from TensorFlow Hub
model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Example sentences
sentence_1 = "The cat sits on the mat."
sentence_2 = "A cat is sitting on a mat."

# Embed the sentences using the Universal Sentence Encoder
embeddings = model([sentence_1, sentence_2])

# Compute cosine similarity between the two sentence embeddings
cosine_similarity = np.inner(embeddings[0], embeddings[1])
print(f"Cosine Similarity: {cosine_similarity}")

    

This example uses the Universal Sentence Encoder to embed two sentences and computes their similarity using the cosine similarity metric.

5. Using BERT for Sentence Similarity

Alternatively, you can use BERT from the Hugging Face library to compute sentence similarity. Here’s an example:

from transformers import BertTokenizer, TFBertModel
import tensorflow as tf
import numpy as np

# Load the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

# Example sentences
sentence_1 = "The cat sits on the mat."
sentence_2 = "A cat is sitting on a mat."

# Tokenize and encode the sentences
inputs_1 = tokenizer(sentence_1, return_tensors='tf')
inputs_2 = tokenizer(sentence_2, return_tensors='tf')

# Get embeddings from the BERT model
embeddings_1 = model(**inputs_1).last_hidden_state[:, 0, :]
embeddings_2 = model(**inputs_2).last_hidden_state[:, 0, :]

# Compute cosine similarity between the two sentence embeddings
cosine_similarity = np.inner(embeddings_1, embeddings_2)[0][0]
print(f"Cosine Similarity (BERT): {cosine_similarity}")

    

This example loads a pre-trained BERT model, embeds two sentences, and calculates their similarity using cosine similarity. BERT captures sentence semantics by processing words in their context.

6. Run Sentence Similarity with Custom Inputs

You can test the sentence similarity with your own sentences. For example, replace the sentences in the previous scripts with any two sentences you want to compare:

sentence_1 = "Artificial Intelligence is revolutionizing the world."
sentence_2 = "AI is changing the future of technology."
    

Rerun the script with these new sentences to calculate the similarity score.

7. Optimize for Performance (Optional)

If you have a GPU, you can speed up TensorFlow computations by ensuring TensorFlow is using your GPU. Install GPU-optimized TensorFlow as follows:

pip install tensorflow-gpu
    

Then, check if TensorFlow detects your GPU:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    

If a GPU is available, TensorFlow will automatically use it for faster inference.

8. Conclusion

By following these steps, you have successfully set up a sentence similarity system on Ubuntu using TensorFlow. You can use either BERT or Universal Sentence Encoder to compute the semantic similarity between sentences. This setup can be applied to various tasks such as text matching, paraphrase detection, and semantic search.