I am trying to find the words occurring in multiple documents at the same time.
Let us take an example.
doc1: "this is a document about milkyway"
doc2: "milky way is huge"
As you can see in above 2 documents, word "milkyway" is occurring in both the docs but in the second document term "milkyway" is separated by a space and in first doc it is not.
I am doing the following to get the document term matrix in R.
library(tm)
tmp.text <- data.frame(rbind(doc1, doc2))
tmp.corpus <- Corpus(DataframeSource(tmp.text))
tmpDTM <- TermDocumentMatrix(tmp.corpus, control = list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df <- as.data.frame(as.matrix(tmpDTM))
tmp.df
1 2
document 1 0
huge 0 1
milky 0 1
milkyway 1 0
way 0 1
Term milkyway
is only present in the first doc as per the above matrix.
I want to be able to get 1 in both the docs for term "milkyway" in the above matrix. This is just an example. I need to do this for a lot of documents. Ultimately I want to be able to treat such words ("milkyway" & "milky way") in a similar manner.
EDIT 1:
Can't I force the term document matrix to get calculated in such a way that for whatever word it is trying to look for it shouldn't just look for that word as a separate word in the string but also within strings? For example, one term is milky
and there is a document this is milkyway
so here currently milky
does not occur in this document but if the algorithm looks for the word in question within strings also it will find the word milky
within string milkyway
, that way words milky
and way
will get counted in my both documents (earlier example).
EDIT 2:
Ultimately I want to be able to calculate similarity cosine index between documents.
On the Insert tab, in the Symbols group, click Symbol. In the box that opens, click More Symbols. In the Symbol dialog box, on the Special Characters tab, click the Nonbreaking Space row to highlight it, and then click Insert. Click Close.
Position the cursor near the left margin under the place you want to divide. Press and hold down the left mouse button and drag the cursor to the right, drawing the divider. Release the mouse button and the “Ctrl” key. Click the divider line once it appears.
use white-space: nowrap; . If you have set width on the element on which you are setting this it should work. It's white-space: nowrap actually.
In Microsoft Word, automatic hyphenation is turned on by default for normal paragraphs (using the Normal style) so Word may insert hyphens and break words across lines. To stop words from splitting across lines in a paragraph or paragraphs by turning off automatic hyphenation: Select the paragraph or paragraphs.
You will need to convert documents to a bag of primitive-word representation before. Where a primitive-word is matched with a set of words. The primitive word can also be in the corpus.
For instance:
milkyway -> {milky, milky way, milkyway}
economy -> {economics, economy}
sport -> {soccer, football, basket ball, basket, NFL, NBA}
You can build such dictionary before computing the cosine distance with both a synonyms dictionary and a edit distance like levenstein which will complete synonym dictionary.
Computing 'sport' key is more involving.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With