How can I tweak Levenshtein distance in classifying linguistically similar words (e.g. verb tenses, adjective comparisons, singular and plural)

Tags:

I am out of ideas on how to complete this task. I am counting the frequency of a word, actually the base form of the word (e.g. running will be counted as run). I looked up on some implementations of Levenshtein distance (one implementation I run into is from dotnerperls).

I also tried the double Metaphone, but it isn't what I'm looking for.

So, please give me some ideas on how to tweak Levenshtein distance algorithm in classifying linguistically similar words since the algorithm is only for determining the number of edits needed not considering if they are linguistically similar or not

Example: 1. "running" will be counted as one occurrence of the word "run" 2. "word" will likewise be an occurrence of "word" 3. "fear" will NOT be counted as an occurrence of "gear"

Also, I am implementing it in C#.

Thanks in advance.

Edit: I edited it as Rene suggested. Another note: I am trying to consider to consider if a word is a substring of another word but that implementation will not be as much dynamic. Another idea I think is: "if adding -s or -ing to string1, string1 == string2, then string2 is an occurrence of string1." However, this is not the case as some words have irregular plurals.

732

asked Jan 07 '12 10:01

Jinnean

1 Answers

The task you are trying to solve is called Stemming or Lemmatisation.

As you figured out already, Levenshtein-Distance is not the way to go here. Common stemming-algorithms for english include the Porter- and Snowball-Stemmer. If you google for that I'm sure you will find a C#-implementation of one of them.

answered Oct 20 '22 20:10

tobigue

Related questions
                            
                                How does Beam Search operate on the output of The Transformer?
                            
                                Finding topics of an unseen document via Gensim
                            
                                Natural Language Processing - Converting Text Features Into Feature Vectors
                            
                                Stanford CoreNLP remove/stop red information print outs
                            
                                Understanding LDA Transformed Corpus in Gensim
                            
                                Algorithm for Determining Word Type using WordNet Database
                            
                                Extract Person Name from unstructure text
                            
                                Sentiment analysis of non-English texts
                            
                                Understanding Word2Vec's Skip-Gram Structure and Output
                            
                                Result Difference in Stanford NER tagger NLTK (python) vs JAVA
                            
                                Intuition behind tf-idf for term extraction
                            
                                Extract grocery list out of free text
                            
                                What exactly are WordNet lexicographer files? Understanding how WordNet works
                            
                                Fuzzy matching a word inside a pyspark dataframe string
                            
                                ValueError: operands could not be broadcast together with shapes in Naive bayes classifier
                            
                                How to recognize entities in text that is the output of optical character recognition (OCR)?
                            
                                What are the inputs to the transformer encoder and decoder in BERT?
                            
                                Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory
                            
                                Document Layout Analysis for text extraction
                            
                                Extracting nouns from Noun Phase in NLP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I tweak Levenshtein distance in classifying linguistically similar words (e.g. verb tenses, adjective comparisons, singular and plural)

Tags:

nlp

levenshtein-distance

similarity

Jinnean

People also ask

1 Answers

tobigue

Recent Activity

Donate For Us