I want to compute how similar two arbitrary sentences are to each other. For example:
- A mathematician found a solution to the problem.
- The problem was solved by a young mathematician.
I can use a tagger, a stemmer, and a parser, but I don’t know how detect that these sentences are similar.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||.
These two sentences are not just similar, they are almost paraphrases, i.e., two alternative ways of expressing the same meaning. It is also a very simple case of paraphrase, in which both utterances use the same words with the only exception of one being in active form while the other is passive. (The two sentences are not exactly paraphrases because in the second sentence the mathematician is "young". This additional information makes the semantic relation between the two sentences non symmetric. In these cases, you would say that the second utterance "entails" the first one, or in other words that the first can be inferred from the second).
From the example it is not possible to understand whether you are actually interested in paraphrase detection, textual entailment or in sentence similarity in general, which is an even broader and fuzzier problem. For example, is "people eat food" more similar to "people eat bread" or to "men eat food"?
Both paraphrase detection and text similarity are complex, open research problems in Natural Language Processing, with a large and active community of researchers working on them. It is not clear what is the extent of your interest in this topic, but consider that even though many brilliant researchers have spent and spend their whole careers trying to crack it, we are still very far from finding sound solutions that just work in general.
Unless you are interested in a very superficial solution that would only work in specific cases and that would not capture syntactic alternation (as in this case), I would suggest that you look into the problem of text similarity in more depth. A good starting point would be the book "Foundations of Statistical Natural Language Processing", which provides a very well organised presentation of most statistical natural language processing topics. Once you have clarified your requirements (e.g., under what conditions is your method supposed to work? what levels of precision/recall are you after? what kind of phenomena can you safely ignore, and which ones you need to account for?) you can start looking into specific approaches by diving into recent research work. Here, a good place to start would be the online archives of the Association for Computational Linguistics (ACL), which is the publisher of most research results in the field.
Just to give you something practical to work with, a very rough baseline for sentence similarity would be the cosine similarity between two binary vectors representing the sentences as bags of words. A bag of word is a very simplified representation of text, commonly used for information retrieval, in which you completely disregard syntax and only represent a sentence as a vector whose size is the size of the vocabulary (i.e., the number of words in the language) and whose component "i" is valued "1" if the word at position "i" in the vocabulary appears in the sentence, and "0" otherwise.
A more modern approach (in 2021) is to use a Machine Learning NLP model. There are pre-trained models exactly for this task, many of them are derived from BERT, so you don't have to train your own model (you could if you wanted to). Here is a code example that uses the excellent Huggingface Transformers library with PyTorch. It's based on this example:
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "bert-base-cased-finetuned-mrpc" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) sequence_0 = "A mathematician found a solution to the problem." sequence_1 = "The problem was solved by a young mathematician." tokens = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt") classification_logits = model(**tokens)[0] results = torch.softmax(classification_logits, dim=1).tolist()[0] classes = ["not paraphrase", "is paraphrase"] for i in range(len(classes)): print(f"{classes[i]}: {round(results[i] * 100)}%")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With