Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Treat words separated by space in the same manner

I am trying to find the words occurring in multiple documents at the same time.

Let us take an example.

doc1: "this is a document about milkyway"
doc2: "milky way is huge"

As you can see in above 2 documents, word "milkyway" is occurring in both the docs but in the second document term "milkyway" is separated by a space and in first doc it is not.

I am doing the following to get the document term matrix in R.

library(tm)
tmp.text <- data.frame(rbind(doc1, doc2))
tmp.corpus <- Corpus(DataframeSource(tmp.text))
tmpDTM <- TermDocumentMatrix(tmp.corpus, control = list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df <- as.data.frame(as.matrix(tmpDTM))
tmp.df

         1 2
document 1 0
huge     0 1
milky    0 1
milkyway 1 0
way      0 1

Term milkyway is only present in the first doc as per the above matrix.

I want to be able to get 1 in both the docs for term "milkyway" in the above matrix. This is just an example. I need to do this for a lot of documents. Ultimately I want to be able to treat such words ("milkyway" & "milky way") in a similar manner.

EDIT 1:

Can't I force the term document matrix to get calculated in such a way that for whatever word it is trying to look for it shouldn't just look for that word as a separate word in the string but also within strings? For example, one term is milky and there is a document this is milkyway so here currently milky does not occur in this document but if the algorithm looks for the word in question within strings also it will find the word milky within string milkyway, that way words milky and way will get counted in my both documents (earlier example).

EDIT 2:

Ultimately I want to be able to calculate similarity cosine index between documents.

like image 692
user3664020 Avatar asked Oct 13 '15 09:10

user3664020


People also ask

How do you fix separated words in word?

On the Insert tab, in the Symbols group, click Symbol. In the box that opens, click More Symbols. In the Symbol dialog box, on the Special Characters tab, click the Nonbreaking Space row to highlight it, and then click Insert. Click Close.

How do I divide a space in word?

Position the cursor near the left margin under the place you want to divide. Press and hold down the left mouse button and drag the cursor to the right, drawing the divider. Release the mouse button and the “Ctrl” key. Click the divider line once it appears.

How do you keep words from splitting in CSS?

use white-space: nowrap; . If you have set width on the element on which you are setting this it should work. It's white-space: nowrap actually.

Why is word splitting words between lines?

In Microsoft Word, automatic hyphenation is turned on by default for normal paragraphs (using the Normal style) so Word may insert hyphens and break words across lines. To stop words from splitting across lines in a paragraph or paragraphs by turning off automatic hyphenation: Select the paragraph or paragraphs.


1 Answers

You will need to convert documents to a bag of primitive-word representation before. Where a primitive-word is matched with a set of words. The primitive word can also be in the corpus.

For instance:

milkyway -> {milky, milky way, milkyway} 
economy -> {economics, economy}
sport -> {soccer, football, basket ball, basket, NFL, NBA}

You can build such dictionary before computing the cosine distance with both a synonyms dictionary and a edit distance like levenstein which will complete synonym dictionary.

Computing 'sport' key is more involving.

like image 189
amirouche Avatar answered Oct 26 '22 10:10

amirouche