I have a large document corpus with more than 200 documents. As you can expect from such a large corpus, some of the words are misspelled, used in different formats, and so on and so forth. I have done the standard text processing such as convert to lower case, remove punctuation, word stemming. I am trying to substitute some words to correct spelling and standardize them before moving on to analysis. I have done more that 100 substitution using the same syntax as below and for most of the substitutions, it is working as expected. However, some (about 5%) are not working. For example the following substitutions seem to have only limited effect:
docs <- tm_map(docs, content_transformer(gsub), pattern = "medecin|medicil|medicin|medicinee", replacement = "medicine")
docs <- tm_map(docs, content_transformer(gsub), pattern = "eephant|eleph|elephabnt|elleph|elephanyt|elephantant|elephantant", replacement = "elephant")
docs <- tm_map(docs, content_transformer(gsub), pattern = "firehood|firewod|firewoo|firewoodloc|firewoog|firewoodd|firewoodd", replacement = "firewood")
By limited effect I mean that even though some substitutions are working, some are not. For example, despite trying to replace "elephantant", "medicinee", "firewoodd", they still exist when I create the DTM (document term matrix).
I have no idea why this mixed effect is happening.
Also the following line is replacing every word in the corpus with some combination of collect:
docs <- tm_map(docs, content_transformer(gsub), pattern = "colect|colleci|collectin|collectiong|collectng|colllect|", replacement = "collect")
Just for reference, when I substitute just a single word, I am using the syntax (notice the fixed=TRUE):
docs <- tm_map(docs, content_transformer(gsub), pattern = "charcola", replacement = "charcoal", fixed=TRUE)
The one that is a single substitution and failing is:
docs <- tm_map(docs, content_transformer(gsub), pattern = "dogmonkeycat", replacement = "dog monkey cat", fixed=TRUE)
The issue you have is that the alternations in your patterns are not anchored, and thus only the first one matched "wins", i.e. used, and the rest is not considered.
You should either use some "anchors" (say, word boundaries) around the alternations:
pattern = "\\b(medecin|medicil|medicin|medicinee)\\b"
or just put the longer alternatives before shorter ones:
pattern = "medicinee|medecin|medicil|medicin"
Note that you can make the pattern faster by using character classes for commonly mistyped vowels (see [ei]
) and groups:
pattern = "med[ie]ci(?:n(?:ee)?|l)"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With