I am making a project like this one here: https://www.youtube.com/watch?v=dovB8uSUUXE&feature=youtu.be but i am facing trouble because i need to check the similarity between the sentences for example: if the user said: 'the person wear red T-shirt' instead of 'the boy wear red T-shirt' I want a method to check the similarity between these two sentences without having to check the similarity between each word is there a way to do this in python?
I am trying to find a way to check the similarity between two sentences.
Our algorithm to confirm document similarity will consist of three fundamental steps: Split the documents in words. Compute the word frequencies. Calculate the dot product of the document vectors.
The easiest way of estimating the semantic similarity between a pair of sentences is by taking the average of the word embeddings of all words in the two sentences, and calculating the cosine between the resulting embeddings.
The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.
Most of there libraries below should be good choice for semantic similarity comparison. You can skip direct word comparison by generating word, or sentence vectors using pretrained models from these libraries.
Spacy
Required models must be loaded first.
For using en_core_web_md
use python -m spacy download en_core_web_md
to download. For using en_core_web_lg
use python -m spacy download en_core_web_lg
.
The large model is around ~830mb as writing and quite slow, so medium one can be a good choice.
https://spacy.io/usage/vectors-similarity/
Code:
import spacy
nlp = spacy.load("en_core_web_lg")
#nlp = spacy.load("en_core_web_md")
doc1 = nlp(u'the person wear red T-shirt')
doc2 = nlp(u'this person is walking')
doc3 = nlp(u'the boy wear red T-shirt')
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))
Output:
0.7003971105290047
0.9671912343259517
0.6121211244876517
Sentence Transformers
https://github.com/UKPLab/sentence-transformers
https://www.sbert.net/docs/usage/semantic_textual_similarity.html
Install with pip install -U sentence-transformers
. This one generates sentence embedding.
Code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
sentences = [
'the person wear red T-shirt',
'this person is walking',
'the boy wear red T-shirt'
]
sentence_embeddings = model.encode(sentences)
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
Output:
Sentence: the person wear red T-shirt
Embedding: [ 1.31643847e-01 -4.20616418e-01 ... 8.13076794e-01 -4.64620918e-01]
Sentence: this person is walking
Embedding: [-3.52878094e-01 -5.04286848e-02 ... -2.36091137e-01 -6.77282438e-02]
Sentence: the boy wear red T-shirt
Embedding: [-2.36365378e-01 -8.49713564e-01 ... 1.06414437e+00 -2.70157874e-01]
Now embedding vector can be used to calculate various similarity metrics.
Code:
from sentence_transformers import SentenceTransformer, util
print(util.pytorch_cos_sim(sentence_embeddings[0], sentence_embeddings[1]))
print(util.pytorch_cos_sim(sentence_embeddings[0], sentence_embeddings[2]))
print(util.pytorch_cos_sim(sentence_embeddings[1], sentence_embeddings[2]))
Output:
tensor([[0.4644]])
tensor([[0.9070]])
tensor([[0.3276]])
Same thing with scipy
and pytorch
,
Code:
from scipy.spatial import distance
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[1]))
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[2]))
print(1 - distance.cosine(sentence_embeddings[1], sentence_embeddings[2]))
Output:
0.4643629193305969
0.9069876074790955
0.3275738060474396
Code:
import torch.nn
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
b = torch.from_numpy(sentence_embeddings)
print(cos(b[0], b[1]))
print(cos(b[0], b[2]))
print(cos(b[1], b[2]))
Output:
tensor(0.4644)
tensor(0.9070)
tensor(0.3276)
TFHub Universal Sentence Encoder
https://tfhub.dev/google/universal-sentence-encoder/4
https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb
Model is very large for this one around 1GB and seems slower than others. This also generates embeddings for sentences.
Code:
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed([
"the person wear red T-shirt",
"this person is walking",
"the boy wear red T-shirt"
])
print(embeddings)
Output:
tf.Tensor(
[[ 0.063188 0.07063895 -0.05998802 ... -0.01409875 0.01863449
0.01505797]
[-0.06786212 0.01993554 0.03236153 ... 0.05772103 0.01787272
0.01740014]
[ 0.05379306 0.07613157 -0.05256693 ... -0.01256405 0.0213196
-0.00262441]], shape=(3, 512), dtype=float32)
Code:
from scipy.spatial import distance
print(1 - distance.cosine(embeddings[0], embeddings[1]))
print(1 - distance.cosine(embeddings[0], embeddings[2]))
print(1 - distance.cosine(embeddings[1], embeddings[2]))
Output:
0.15320375561714172
0.8592830896377563
0.09080004692077637
https://github.com/facebookresearch/InferSent
https://github.com/Tiiiger/bert_score
This illustration shows the method,
How to compute the similarity between two text documents?
https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity
https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html
https://www.tensorflow.org/api_docs/python/tf/keras/losses/CosineSimilarity
https://nlp.town/blog/sentence-similarity/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With