I am new to R and need suggestions. I have a dataframe with 1 text field in it. I need to fix the misspelled words in that text field. To help with that, I have a second file (dictionary) with 2 columns - the misspelled words and the correct words to replace them.
How would you recommend doing it? I wrote a simple "for loop" but the performance is an issue. The file has ~120K rows and the dictionary has ~5k rows and the program's been running for hours. The text can have a max of 2000 characters.
Here is the code:
output<-source_file$MEMO_MANUAL_TXT
for (i in 1:nrow(fix_file)) { #dictionary file
target<-paste0(" ", fix_file$change_to_target[i], " ")
replace<-paste0(" ", fix_file$target[i], " ")
output<-gsub(target, replace, output, fixed = TRUE)
I would try agrep. I'm not sure how well it scales though.
Eg.
> agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
Also check out pmatch and charmatch although I feel they won't be as useful to you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With