How to remove unicode from string?

Question

I have a string like:

q <-"<U+00A6>  1000-66329"

I want to remove <U+00A6> and get only 1000 66329.

I tried using:

gsub("\u00a6"," ", q,perl=T)

But it is not removing anything. How should I do gsub in order to get only 1000 66329?

Wiktor Stribiżew · Accepted Answer

I just want to remove unicode <U+00A6> which is at the beginning of string.

Then you do not need a gsub, you can use a sub with "^\s*<U\+\w+>\s*" pattern:

q <-"<U+00A6>  1000-66329"
sub("^\s*<U\+\w+>\s*", "", q)

Pattern details:

^ - start of string
\s* - zero or more whitespaces
<U\+ - a literal char sequence <U+
\w+ - 1 or more letters, digits or underscores
> - a literal >
\s* - zero or more whitespaces.

If you also need to replace the - with a space, add |- alternative and use gsub (since now we expect several replacements and the replacement must be a space - same is in akrun's answer):

trimws(gsub("^\s*<U\+\w+>|-", " ", q))

See the R online demo

Rentrop · Answer

If always is the first character, you can try:

substring("\U00A6 1000-66B29", 2)

if R prints the string as <U+00A6> 1000-66329 instead of ¦ 1000-66B29 then <U+00A6> is interpreted as the string "<U+00A6>" instead of the unicode character. Then you can do:

substring("<U+00A6>  1000-66329",9)

Both ways the result is:

[1] "  1000-66329"

How to remove unicode <U+00A6> from string?

Tags:

regex

r

gsub

user6559913

2 Answers

Wiktor Stribiżew

Rentrop

Recent Activity

Donate For Us