Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove all the Chinese characters from a string?

Tags:

string

r

I am trying to remove all the Chinese characters from the following string:

x <- "2.87Y 1282501 12电网MTN4 AAA 4.40 /4.30* 2000、" 

How can I do this?

like image 706
Huimin Peng Avatar asked Dec 19 '22 03:12

Huimin Peng


2 Answers

I went Googling around and found a page about Unicode character ranges. After looking through some of the CJK (Chinese, Japanese, Korean) Unicode ranges, I came to the conclusion that you need to remove the following Unicode ranges if all your strings are similar to this particular string.

  • 4E00-9FFF for CJK Unified Ideographs
  • 3000-303F for CJK Symbols and Punctuation

Using gsub(), we can do

gsub("[\U4E00-\U9FFF\U3000-\U303F]", "", x)
# [1] "2.87Y 1282501 12MTN4 AAA 4.40 /4.30* 2000"

Data:

x <- "2.87Y 1282501 12电网MTN4 AAA 4.40 /4.30* 2000、" 
like image 84
Rich Scriven Avatar answered Jan 11 '23 22:01

Rich Scriven


You can also do this using iconv. This will remove all Non-ASCII characters including your Chinese, Japanese, Korean etc.

iconv(x, "latin1", "ASCII", sub="")
#[1] "2.87Y 1282501 12MTN4 AAA 4.40 /4.30* 2000"
like image 35
Santosh M. Avatar answered Jan 11 '23 22:01

Santosh M.