I'm currently in the process of developing a program with the capability of comparing a small text (say 250 characters) to a collection of similar texts (around 1000-2000 texts). The purpose is to evalute if text A is similar to one or more texts in the collection and if so, the text in the collection has to be retrievable by ID. Each texts will have a unique ID. There is two ways I'd like the output to be: Option 1: Text A matched Text B with 90% similarity, Text C with 70% similarity, and so on. Option 2: Text A matched Text D with highest similarity I have read some machine learning in school but I'm not sure which algorithm suits this problem the best or if I should consider using NLP (not familiar with the subject). Does anyone have a suggestion of what algorithm to use or where I can find the nessecary literature to solve my problem?

It does not seem to be a machine learning problem, you are simply looking for some text similarity measure. Once you select one, you just sort your data according to achieved "scores". Depending on your texts, you can use one of the following metrics (list from the wiki) or define your own: <ul> <li>Hamming distance</li> <li>Levenshtein distance and Damerau–Levenshtein distance</li> <li>Needleman–Wunsch distance or Sellers' algorithm</li> <li>Smith–Waterman distance</li> <li>Gotoh distance or Smith-Waterman-Gotoh distance</li> <li>Monge Elkan distance</li> <li>Block distance or L1 distance or City block distance</li> <li>Jaro–Winkler distance</li> <li>Soundex distance metric</li> <li>Simple matching coefficient (SMC)</li> <li>Dice's coefficient</li> <li>Jaccard similarity or Jaccard coefficient or Tanimoto coefficient</li> <li>Tversky index</li> <li>Overlap coefficient</li> <li>Euclidean distance or L2 distance</li> <li>Cosine similarity</li> <li>Variational distance</li> <li>Hellinger distance or Bhattacharyya distance</li> <li>Information radius (Jensen–Shannon divergence)</li> <li>Skew divergence</li> <li>Confusion probability</li> <li>Tau metric, an approximation of the Kullback–Leibler divergence</li> <li>Fellegi and Sunters metric (SFS)</li> <li>Maximal matches</li> <li>Lee distance</li> </ul> Some of the above (like ie. cosine similarity) require transforming your data into vectorized format. This process can also be achieved in many ways, with the simplest possible bag of words/tfidf techniques. List itself is far from being complete, is just a draft of such methods. In particular, there are many string kernels, which are also suited for measuring text similarity. In particular Wordnet Kernel can measure semantic similarity based on the one of the most complete semantic databse of the english language.

I heard there are three approaches from Dr. Golden: <ul> <li>Cosine Angular Separation </li> <li>Hamming Distance </li> <li>Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI) </li> </ul> These methods are based on semantic similarity. I also heard some company used tool called Spacy to summarize document to compare each other.

NLP/Machine Learning text comparison [closed]

2 Answers

It does not seem to be a machine learning problem, you are simply looking for some text similarity measure. Once you select one, you just sort your data according to achieved "scores".

Depending on your texts, you can use one of the following metrics (list from the wiki) or define your own:

Hamming distance
Levenshtein distance and Damerau–Levenshtein distance
Needleman–Wunsch distance or Sellers' algorithm
Smith–Waterman distance
Gotoh distance or Smith-Waterman-Gotoh distance
Monge Elkan distance
Block distance or L1 distance or City block distance
Jaro–Winkler distance
Soundex distance metric
Simple matching coefficient (SMC)
Dice's coefficient
Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
Tversky index
Overlap coefficient
Euclidean distance or L2 distance
Cosine similarity
Variational distance
Hellinger distance or Bhattacharyya distance
Information radius (Jensen–Shannon divergence)
Skew divergence
Confusion probability
Tau metric, an approximation of the Kullback–Leibler divergence
Fellegi and Sunters metric (SFS)
Maximal matches
Lee distance

Some of the above (like ie. cosine similarity) require transforming your data into vectorized format. This process can also be achieved in many ways, with the simplest possible bag of words/tfidf techniques.

List itself is far from being complete, is just a draft of such methods. In particular, there are many string kernels, which are also suited for measuring text similarity. In particular Wordnet Kernel can measure semantic similarity based on the one of the most complete semantic databse of the english language.

answered Oct 20 '22 05:10

lejlot

I heard there are three approaches from Dr. Golden:

Cosine Angular Separation
Hamming Distance
Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)

These methods are based on semantic similarity.

I also heard some company used tool called Spacy to summarize document to compare each other.

answered Oct 20 '22 06:10

Cloud Cho

Related questions
                            
                                How to use additional features along with word embeddings in Keras ?
                            
                                What is the differences between Apache Spark and Apache Apex?
                            
                                python - how to append numpy array to a pandas dataframe
                            
                                sklearn GridSearchCV not using sample_weight in score function
                            
                                How to iterate over layers in Pytorch
                            
                                Should the custom loss function in Keras return a single loss value for the batch or an arrary of losses for every sample in the training batch?
                            
                                Create Sparse Matrix from a data frame
                            
                                In Tensorflow, what is the difference between sampled_softmax_loss and softmax_cross_entropy_with_logits
                            
                                How do I set TensorFlow RNN state when state_is_tuple=True?
                            
                                R - XGBoost: Error building DMatrix
                            
                                Removing then Inserting a New Middle Layer in a Keras Model
                            
                                Keras Sequential model input layer
                            
                                What is the difference between these two ways of saving keras machine learning model weights?
                            
                                show feature names after feature selection
                            
                                How to sum leading diagonal of table in R
                            
                                Realistic time estimates for progress bars etc
                            
                                Machine Learning on server log data
                            
                                Does the dataset size influence a machine learning algorithm?
                            
                                What is rank in ALS machine Learning Algorithm in Apache Spark Mllib
                            
                                How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NLP/Machine Learning text comparison [closed]

Tags:

machine-learning

nlp

RobertH

People also ask

2 Answers

lejlot

Cloud Cho

Recent Activity

Donate For Us