I'm looking for an algorithm than can compare two text messages (let's say forum posts) and identify the similarity in percentage.
What would be the most efficient solution for this purpose?
The idea is to use this algorithm to identify users on a forum who have more than two nicknames, pretending to be different people.
I'm going to build a program that will read all their posts and compare each post from the first account to posts of the second account to find whether they are genuinely two different persons or just two registrations of a single user.
Similarity is calculated by measuring the cosine of the angle between two vectors [8]. Because of the size of the document, even if two similar documents are far away from Euclid, it is more advantageous to use the cosine distance to measure similarity.
Similarity algorithms compute the similarity of pairs of nodes based on their neighborhoods or their properties.
Text Similarity - Jaccard, Euclidean, Cosine. Text Embeddings. Word Embeddings. One-Hot Encoding & Bag-of-Words. Term Frequency-Inverse Document Frequency (TF-IDF)
The first thing that came to my mind was the Levenshtein Distance, but it is more focused on words similarities.
You could use tf-idf, but it will probably work better if your corpus contains more than only two documents.
An alternative could be representing the documents (posts) using a vector space model, like:
(w_0, w_1, ..., w_k)
where
k
is the total of terms (words) in your documentw_i
is the i-th
term.and then compute the Hamming Distance, which basically compares two vectors (arrays) and count the positions where they are different. You can discard stop-words first (i.e. words like prepositions, etc.)
Take in count that the user might change some words, use synonyms, etc. There are lots of models for representing documents, computing similarity between them. Some of them take in count words dependency, which gives more semantic to the process, and others don't.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With