Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R? [duplicate]

Tags:

I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+00F8.

I am using Quanteda and I have imported my text using this code:

 EUCorpus <- corpus(textfile(file="/Users/RiohBurke/Documents/RStudio/PROJECT/*.txt"), encodingFrom = "UTF-8-BOM")

My corpus consists of 166 documents. Having imported the documents into R, what would be the best way to get rid of these non-ASCII characters?

like image 667
Ricardo Avatar asked Jul 04 '16 10:07

Ricardo


1 Answers

Try:

texts(EUCorpus) <- iconv(texts(EUCorpus), from = "UTF-8", to = "ASCII", sub = "")

This converts the encoding to ASCII, replacing any non-translatable characters (those not in the 0-127 ASCII range) to nothingness.

like image 113
Ken Benoit Avatar answered Sep 28 '22 04:09

Ken Benoit