Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R removing unicode linebreaks

Tags:

regex

r

I have Unicode newline characters in a string in which I need to remove.

These characters can be carriage return \U000D, newline \U000A, line separator or paragraph separator.

I am able to remove the carriage return and newline characters by using the following.

gsub("\\s", "", x)

Like I said this works fine for those Unicode characters, but I am not able to remove the the line separator \U2028 or paragraph separator \U2029 characters.

Is there another way to do this?

like image 696
user3856888 Avatar asked Aug 23 '14 23:08

user3856888


1 Answers

You can switch on PCRE using perl=T and utilize the handy escape sequence (\R)

> x <- 'foo\U000D\U000A bar\U2029 baz\U2028\U2029'
> x
## [1] "foo\r\n bar\u2029 baz\u2028\u2029"
> gsub('\\R', '', x, perl=T)
## [1] "foo bar baz"
like image 160
hwnd Avatar answered Oct 08 '22 06:10

hwnd