Treat words separated by space in the same manner

Tags:

I am trying to find the words occurring in multiple documents at the same time.

Let us take an example.

doc1: "this is a document about milkyway"
doc2: "milky way is huge"

As you can see in above 2 documents, word "milkyway" is occurring in both the docs but in the second document term "milkyway" is separated by a space and in first doc it is not.

I am doing the following to get the document term matrix in R.

library(tm)
tmp.text <- data.frame(rbind(doc1, doc2))
tmp.corpus <- Corpus(DataframeSource(tmp.text))
tmpDTM <- TermDocumentMatrix(tmp.corpus, control = list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df <- as.data.frame(as.matrix(tmpDTM))
tmp.df

         1 2
document 1 0
huge     0 1
milky    0 1
milkyway 1 0
way      0 1

Term milkyway is only present in the first doc as per the above matrix.

I want to be able to get 1 in both the docs for term "milkyway" in the above matrix. This is just an example. I need to do this for a lot of documents. Ultimately I want to be able to treat such words ("milkyway" & "milky way") in a similar manner.

EDIT 1:

Can't I force the term document matrix to get calculated in such a way that for whatever word it is trying to look for it shouldn't just look for that word as a separate word in the string but also within strings? For example, one term is milky and there is a document this is milkyway so here currently milky does not occur in this document but if the algorithm looks for the word in question within strings also it will find the word milky within string milkyway, that way words milky and way will get counted in my both documents (earlier example).

EDIT 2:

Ultimately I want to be able to calculate similarity cosine index between documents.

692

asked Oct 13 '15 09:10

user3664020

1 Answers

You will need to convert documents to a bag of primitive-word representation before. Where a primitive-word is matched with a set of words. The primitive word can also be in the corpus.

For instance:

milkyway -> {milky, milky way, milkyway} 
economy -> {economics, economy}
sport -> {soccer, football, basket ball, basket, NFL, NBA}

You can build such dictionary before computing the cosine distance with both a synonyms dictionary and a edit distance like levenstein which will complete synonym dictionary.

Computing 'sport' key is more involving.

189

answered Oct 26 '22 10:10

amirouche

Related questions
                            
                                R caret / How does cross-validation for train within rfe work
                            
                                R documentation on ggplot_gtable and ggplot_build [closed]
                            
                                How to create a sub-class of data.frame with additional features
                            
                                Creating arrow head matching size (or lwd) in ggplot2
                            
                                R: conditional expand.grid function
                            
                                table header using ggplot2
                            
                                Why doesn't class(data.frame(...)) show list inheritance?
                            
                                What concept is involved here? Example in Python and R.
                            
                                Text Categorization in R
                            
                                Setting parent.env, followed by `detach`, segfaults
                            
                                How to identify overlaps in multiple columns
                            
                                Format model display in texreg or stargazer R as scientific
                            
                                Error in ls(envir = envir, all.names = private) : invalid 'envir' argument in R
                            
                                Base function that behaves like `cat` but returns value instead of writing to file
                            
                                Why is GGally::ggpairs significantly slower in RStudio vs. base R?
                            
                                How to assign fixed memory size to a variable in R
                            
                                Combine group_by and distinct
                            
                                Rcharts nvd3 2-D zoom possible?
                            
                                R / RStudio : graph scaling issues & fuzziness on high dpi screens
                            
                                How do I quickly find out whether two (large) factors are relabelings of each other?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Treat words separated by space in the same manner

Tags:

r

text-mining

corpus

tm

user3664020

People also ask

1 Answers

amirouche

Recent Activity

Donate For Us