I have a collection of sentences, and I need to analyse them to see how similar they are.
Are there any established algorithms to do this?
I care about:
I've used Levenshtein distance and n-grams for spelling before, although I'm not entirely confident if these translate to my purposes.
Naively, "I don't care about spelling differences, typos can be treated as different words" although perhaps it would be nice to account for this.
perhaps some hybrid of splitting the sentence at spaces and one of the above (or other) algorithms would be a starting point
What options are available? Any advice?
Thanks!
Similarity algorithms compute the similarity of pairs of nodes based on their neighborhoods or their properties. Several similarity metrics can be used to compute a similarity score.
The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.
The books share a similarity of ideas. I see a lot of similarities in them. Looking at these fossils, I see some similarity to modern-day birds. I see very little similarity between your situation and his.
This paper compares several sentence similarity measures. Perhaps you can use one of them as is, or modify it for your needs.
Otherwise sentence similarity measure is a good key term to google for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With