Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?

Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).

  1. Is there another or better way to do something like this?
  2. Is this any different than string similarity?
  3. Is this the right question to be asking?

The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing

like image 332
Kristian Avatar asked Apr 02 '12 21:04

Kristian


People also ask

How do you compare similarity of strings?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

What is similarity algorithm?

Similarity algorithms compute the similarity of pairs of nodes based on their neighborhoods or their properties. Several similarity metrics can be used to compute a similarity score.

What is string similarity search?

String similarity search is a fundamental query that has been widely used for DNA sequencing, error-tolerant query autocompletion, and data cleaning needed in database, data warehouse, and data mining.

How do you find the similarity between two text files?

The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.


2 Answers

It is much more difficult than string similarity. This is what you need to do at a minimum:

  • Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
  • Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
  • Calculate a weight for every term.
  • Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
  • Run a clustering algorithm on document vectors.

Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.

like image 85
Ali Ferhat Avatar answered Sep 23 '22 15:09

Ali Ferhat


The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. [...]

like image 40
Franck Dernoncourt Avatar answered Sep 21 '22 15:09

Franck Dernoncourt