How to remove strange characters using gsub in R? [duplicate]

Question

I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8').

If I don't specify the encoding, I see all kinds of strange characters like:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> ðŸ˜œðŸ˜â˜º'"

This is what it looks like after readLines(..., encoding='UTF-8'):

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

You can see the unicode literals at the end: \u009f, \u0098, etc.

I can't find the right command and regular expression to get rid of these. I've tried:

gsub('[^[:punct:][:alnum:][\s]]', '', text)

I also tried specifying the unicode characters, but I believe they're getting interpreted as text:

gsub('\u009', '', text) # Unchanged

acylam · Accepted Answer

If you want to use regular expressions, you can keep only those characters you want using a range of ASCII codes:

text = "The way I talk to my family......i would get my ass beat to 
DEATH....but they kno I cray cray & just leave it at that ðŸ˜œðŸ˜â˜º'"

gsub('[^\x20-\x7E]', '', text)

# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"

Below is a table of ASCII codes taken from asciitable.com:

enter image description here

You can see that I am removing any character not within the range of x20 (SPACE) and x7E (~).

Nate Reed · Answer

The easiest way to get rid of these characters is to convert from utf-8 to ascii:

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')

How to remove strange characters using gsub in R? [duplicate]

Tags:

r

unicode

utf-8

Nate Reed

2 Answers

acylam

Nate Reed

Recent Activity

Donate For Us

How to remove strange characters using gsub in R? [duplicate]

Tags:

r

unicode

utf-8

Nate Reed

2 Answers

acylam

Nate Reed

Related questions

Recent Activity

Donate For Us