Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to identify similarity between text messages

I'm looking for an algorithm than can compare two text messages (let's say forum posts) and identify the similarity in percentage.

What would be the most efficient solution for this purpose?

The idea is to use this algorithm to identify users on a forum who have more than two nicknames, pretending to be different people.

I'm going to build a program that will read all their posts and compare each post from the first account to posts of the second account to find whether they are genuinely two different persons or just two registrations of a single user.

like image 746
SharpAffair Avatar asked Feb 28 '14 23:02

SharpAffair


People also ask

How do you measure similarity between two texts?

Similarity is calculated by measuring the cosine of the angle between two vectors [8]. Because of the size of the document, even if two similar documents are far away from Euclid, it is more advantageous to use the cosine distance to measure similarity.

What is similarity algorithm?

Similarity algorithms compute the similarity of pairs of nodes based on their neighborhoods or their properties.

What are other text similarity techniques?

Text Similarity - Jaccard, Euclidean, Cosine. Text Embeddings. Word Embeddings. One-Hot Encoding & Bag-of-Words. Term Frequency-Inverse Document Frequency (TF-IDF)


1 Answers

The first thing that came to my mind was the Levenshtein Distance, but it is more focused on words similarities.

You could use tf-idf, but it will probably work better if your corpus contains more than only two documents.

An alternative could be representing the documents (posts) using a vector space model, like:

(w_0, w_1, ..., w_k)

where

  • k is the total of terms (words) in your document
  • w_i is the i-th term.

and then compute the Hamming Distance, which basically compares two vectors (arrays) and count the positions where they are different. You can discard stop-words first (i.e. words like prepositions, etc.)

Take in count that the user might change some words, use synonyms, etc. There are lots of models for representing documents, computing similarity between them. Some of them take in count words dependency, which gives more semantic to the process, and others don't.

like image 168
Oscar Mederos Avatar answered Sep 21 '22 05:09

Oscar Mederos