I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8')
.
If I don't specify the encoding, I see all kinds of strange characters like:
> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> 😜ðŸ˜â˜º'"
This is what it looks like after readLines(..., encoding='UTF-8'):
> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"
You can see the unicode literals at the end: \u009f, \u0098, etc.
I can't find the right command and regular expression to get rid of these. I've tried:
gsub('[^[:punct:][:alnum:][\\s]]', '', text)
I also tried specifying the unicode characters, but I believe they're getting interpreted as text:
gsub('\u009', '', text) # Unchanged
If you want to use regular expressions, you can keep only those characters you want using a range of ASCII codes:
text = "The way I talk to my family......i would get my ass beat to
DEATH....but they kno I cray cray & just leave it at that 😜ðŸ˜â˜º'"
gsub('[^\x20-\x7E]', '', text)
# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"
Below is a table of ASCII codes taken from asciitable.com:
You can see that I am removing any character not within the range of x20 (SPACE) and x7E (~).
The easiest way to get rid of these characters is to convert from utf-8 to ascii:
combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With