what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R? [duplicate]

Question

I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+00F8.

I am using Quanteda and I have imported my text using this code:

 EUCorpus <- corpus(textfile(file="/Users/RiohBurke/Documents/RStudio/PROJECT/*.txt"), encodingFrom = "UTF-8-BOM")

My corpus consists of 166 documents. Having imported the documents into R, what would be the best way to get rid of these non-ASCII characters?

Ken Benoit · Accepted Answer

Try:

texts(EUCorpus) <- iconv(texts(EUCorpus), from = "UTF-8", to = "ASCII", sub = "")

This converts the encoding to ASCII, replacing any non-translatable characters (those not in the 0-127 ASCII range) to nothingness.

what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R? [duplicate]

Tags:

Ricardo

1 Answers

Ken Benoit

Recent Activity

Donate For Us

what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R? [duplicate]

Tags:

Ricardo

1 Answers

Ken Benoit

Related questions

Recent Activity

Donate For Us