I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+00F8
.
I am using Quanteda and I have imported my text using this code:
EUCorpus <- corpus(textfile(file="/Users/RiohBurke/Documents/RStudio/PROJECT/*.txt"), encodingFrom = "UTF-8-BOM")
My corpus consists of 166 documents. Having imported the documents into R, what would be the best way to get rid of these non-ASCII characters?
Try:
texts(EUCorpus) <- iconv(texts(EUCorpus), from = "UTF-8", to = "ASCII", sub = "")
This converts the encoding to ASCII, replacing any non-translatable characters (those not in the 0-127 ASCII range) to nothingness.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With