Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove strange characters using gsub in R? [duplicate]

Tags:

r

unicode

utf-8

I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8').

If I don't specify the encoding, I see all kinds of strange characters like:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> 😜ðŸ˜â˜º'"

This is what it looks like after readLines(..., encoding='UTF-8'):

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

You can see the unicode literals at the end: \u009f, \u0098, etc.

I can't find the right command and regular expression to get rid of these. I've tried:

gsub('[^[:punct:][:alnum:][\\s]]', '', text)

I also tried specifying the unicode characters, but I believe they're getting interpreted as text:

gsub('\u009', '', text) # Unchanged
like image 691
Nate Reed Avatar asked Aug 08 '16 11:08

Nate Reed


2 Answers

If you want to use regular expressions, you can keep only those characters you want using a range of ASCII codes:

text = "The way I talk to my family......i would get my ass beat to 
DEATH....but they kno I cray cray & just leave it at that 😜ðŸ˜â˜º'"

gsub('[^\x20-\x7E]', '', text)

# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"

Below is a table of ASCII codes taken from asciitable.com:

enter image description here

You can see that I am removing any character not within the range of x20 (SPACE) and x7E (~).

like image 90
acylam Avatar answered Nov 01 '22 14:11

acylam


The easiest way to get rid of these characters is to convert from utf-8 to ascii:

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')
like image 31
Nate Reed Avatar answered Nov 01 '22 12:11

Nate Reed