Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP/Machine Learning text comparison [closed]

I'm currently in the process of developing a program with the capability of comparing a small text (say 250 characters) to a collection of similar texts (around 1000-2000 texts).

The purpose is to evalute if text A is similar to one or more texts in the collection and if so, the text in the collection has to be retrievable by ID. Each texts will have a unique ID.

There is two ways I'd like the output to be:

Option 1: Text A matched Text B with 90% similarity, Text C with 70% similarity, and so on.

Option 2: Text A matched Text D with highest similarity

I have read some machine learning in school but I'm not sure which algorithm suits this problem the best or if I should consider using NLP (not familiar with the subject).

Does anyone have a suggestion of what algorithm to use or where I can find the nessecary literature to solve my problem?

like image 676
RobertH Avatar asked Aug 26 '13 08:08

RobertH


People also ask

How does NLP find document similarity?

The IDF is logarithmic of the total number of documents divided by the total number of documents that contain the term, for example: if there are 50.000 documents and the word 'stock' appears in 500 documents so the IDF is the log(50000/500) = 4.6. So the TF-IDF of 'stock' is 4.6 * 0.01 = 0.046.

Which are text matching techniques in NLP?

In NLP, semantic matching techniques aim to compare two sentences to determine if they have similar meaning. Note: A sentence can be a phrase, a paragraph or any distinct chunk of text. This is especially important in search.

How is NLP different from text mining?

NLP and text mining differ in the goal for which they are used. NLP is used to understand human language by analyzing text, speech, or grammatical syntax. Text mining is used to extract information from unstructured and structured content. It focuses on structure rather than the meaning of content.


2 Answers

It does not seem to be a machine learning problem, you are simply looking for some text similarity measure. Once you select one, you just sort your data according to achieved "scores".

Depending on your texts, you can use one of the following metrics (list from the wiki) or define your own:

  • Hamming distance
  • Levenshtein distance and Damerau–Levenshtein distance
  • Needleman–Wunsch distance or Sellers' algorithm
  • Smith–Waterman distance
  • Gotoh distance or Smith-Waterman-Gotoh distance
  • Monge Elkan distance
  • Block distance or L1 distance or City block distance
  • Jaro–Winkler distance
  • Soundex distance metric
  • Simple matching coefficient (SMC)
  • Dice's coefficient
  • Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
  • Tversky index
  • Overlap coefficient
  • Euclidean distance or L2 distance
  • Cosine similarity
  • Variational distance
  • Hellinger distance or Bhattacharyya distance
  • Information radius (Jensen–Shannon divergence)
  • Skew divergence
  • Confusion probability
  • Tau metric, an approximation of the Kullback–Leibler divergence
  • Fellegi and Sunters metric (SFS)
  • Maximal matches
  • Lee distance

Some of the above (like ie. cosine similarity) require transforming your data into vectorized format. This process can also be achieved in many ways, with the simplest possible bag of words/tfidf techniques.

List itself is far from being complete, is just a draft of such methods. In particular, there are many string kernels, which are also suited for measuring text similarity. In particular Wordnet Kernel can measure semantic similarity based on the one of the most complete semantic databse of the english language.

like image 88
lejlot Avatar answered Oct 20 '22 05:10

lejlot


I heard there are three approaches from Dr. Golden:

  • Cosine Angular Separation

  • Hamming Distance

  • Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)

These methods are based on semantic similarity.

I also heard some company used tool called Spacy to summarize document to compare each other.

like image 28
Cloud Cho Avatar answered Oct 20 '22 06:10

Cloud Cho