This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external website for this, which is too slow as I have very large corpus to find the lemmas for). So a part of this question will be similar to the above mentioned question. According to Wikipedia, lemmatization is defined as: <blockquote> Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. </blockquote> A simple Google search for lemmatization in R will only point to the package <code>wordnet</code> of R. When I tried this package expecting that a character vector <code>c("run", "ran", "running")</code> input to the lemmatization function would result in <code>c("run", "run", "run")</code>, I saw that this package only provides functionality similar to <code>grepl</code> function through various filter names and a dictionary. An example code from <code>wordnet</code> package, which gives maximum of 5 words starting with "car", as the filter name explains itself: <pre class="prettyprint"><code>filter <- getTermFilter("StartsWithFilter", "car", TRUE) terms <- getIndexTerms("NOUN", 5, filter) sapply(terms, getLemma) </code></pre> The above is NOT the lemmatization that I'm looking for. What I'm looking for is, using <code>R</code> I want to find true roots of the words: (For e.g. from <code>c("run", "ran", "running")</code> to <code>c("run", "run", "run")</code>).

Hello you can try package <code>koRpus</code> which allow to use Treetagger : <pre class="prettyprint"><code>tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj", TT.tknz=FALSE , lang="en", TT.options=list(path="./TreeTagger", preset="en")) tagged.results@TT.res ## token tag lemma lttr wclass desc stop stem ## 1 run NN run 3 noun Noun, singular or mass NA NA ## 2 ran VVD run 3 verb Verb, past tense NA NA ## 3 running VVG run 7 verb Verb, gerund or present participle NA NA </code></pre> See the <code>lemma</code> column for the result you're asking for.

As a previous post mentioned, the function lemmatize_words() from the R package textstem can perform this and give you what I understand as your desired results: <pre class="prettyprint"><code>library(textstem) vector <- c("run", "ran", "running") lemmatize_words(vector) ## [1] "run" "run" "run" </code></pre>

How to perform Lemmatization in R?

Tags:

r

nlp

lemmatization

This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external website for this, which is too slow as I have very large corpus to find the lemmas for). So a part of this question will be similar to the above mentioned question.

According to Wikipedia, lemmatization is defined as:

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

A simple Google search for lemmatization in R will only point to the package wordnet of R. When I tried this package expecting that a character vector c("run", "ran", "running") input to the lemmatization function would result in c("run", "run", "run"), I saw that this package only provides functionality similar to grepl function through various filter names and a dictionary.

An example code from wordnet package, which gives maximum of 5 words starting with "car", as the filter name explains itself:

filter <- getTermFilter("StartsWithFilter", "car", TRUE)
terms <- getIndexTerms("NOUN", 5, filter)
sapply(terms, getLemma)

The above is NOT the lemmatization that I'm looking for. What I'm looking for is, using R I want to find true roots of the words: (For e.g. from c("run", "ran", "running") to c("run", "run", "run")).

655

asked Jan 29 '15 11:01

StrikeR

4 Answers

Hello you can try package koRpus which allow to use Treetagger :

tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
                      TT.tknz=FALSE , lang="en",
                      TT.options=list(path="./TreeTagger", preset="en"))
[email protected]

##     token tag lemma lttr wclass                               desc stop stem
## 1     run  NN   run    3   noun             Noun, singular or mass   NA   NA
## 2     ran VVD   run    3   verb                   Verb, past tense   NA   NA
## 3 running VVG   run    7   verb Verb, gerund or present participle   NA   NA

See the lemma column for the result you're asking for.

195

answered Oct 17 '22 18:10

Victorp

As a previous post mentioned, the function lemmatize_words() from the R package textstem can perform this and give you what I understand as your desired results:

library(textstem)
vector <- c("run", "ran", "running")
lemmatize_words(vector)

## [1] "run" "run" "run"

answered Oct 17 '22 16:10

Andy

@Andy and @Arunkumar are correct when they say textstem library can be used to perform stemming and/or lemmatization. However, lemmatize_words() will only work on a vector of words. But in a corpus, we do not have vector of words; we have strings, with each string being a document's content. Hence, to perform lemmatization on a corpus, you can use function lemmatize_strings() as an argument to tm_map() of tm package.

> corpus[[1]]
[1] " earnest roughshod document serves workable primer regions recent history make 
terrific th-grade learning tool samuel beckett applied iranian voting process bard 
black comedy willie loved another trumpet blast may new mexican cinema -bornin "
> corpus <- tm_map(corpus, lemmatize_strings)
> corpus[[1]]
[1] "earnest roughshod document serve workable primer region recent history make 
terrific th - grade learn tool samuel beckett apply iranian vote process bard black 
comedy willie love another trumpet blast may new mexican cinema - bornin"

Do not forget to run the following line of code after you have done lemmatization:

> corpus <- tm_map(corpus, PlainTextDocument)

This is because in order to create a document-term matrix, you need to have 'PlainTextDocument' type object, which gets changed after you use lemmatize_strings() (to be more specific, the corpus object does not contain content and meta-data of each document anymore - it is now just a structure containing documents' content; this is not the type of object that DocumentTermMatrix() takes as an argument).

Hope this helps!

answered Oct 17 '22 18:10

Harshit Lamba

Maybe stemming is enough for you? Typical natural language processing tasks make do with stemmed texts. You can find several packages from CRAN Task View of NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

If you really do require something more complex, then there's specialized solutsions based on mapping sentences to neural nets. As far as I know, these require massive amount of training data. There is lots of open software created and made available by Stanford NLP Group.

If you really want to dig into the topic, then you can dig through the event archives linked at the same Stanford NLP Group publications section. There's some books on the topic as well.

answered Oct 17 '22 17:10

LauriK

Related questions
                            
                                Fresh new session for rstudio-server
                            
                                Using spread with duplicate identifiers for rows
                            
                                How can I insert an image into the navbar on a shiny navbarPage()
                            
                                Venn Diagrams with R? [closed]
                            
                                How to print the structure of an R object to the console
                            
                                Extract standard errors from lm object
                            
                                Converting python objects for rpy2
                            
                                Disable assignment via = in R
                            
                                How to save summary(lm) to a file?
                            
                                How to get geom_vline to honor facet_wrap?
                            
                                How to check if a vector contains n consecutive numbers
                            
                                ggplot2, legend on top and margin
                            
                                How to jitter/remove overlap for geom_text labels
                            
                                How to avoid using round() in every \Sexpr{}?
                            
                                Gradient legend in base
                            
                                How to check file size before opening?
                            
                                Changing date format to "%d/%m/%Y"
                            
                                Creating a data frame from two vectors using cbind
                            
                                How to select some rows with specific rownames from a dataframe? [closed]
                            
                                ggplot2: Divide Legend into Two Columns, Each with Its Own Title

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With