Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to strsplit using '|' character, it behaves unexpectedly?

I would like to split a string of character at pattern "|"

but

unlist(strsplit("I am | very smart", " | "))

[1] "I"     "am"    "|"     "very"  "smart"

or

gsub(pattern="|", replacement="*", x="I am | very smart")    

[1] "*I* *a*m* *|* *v*e*r*y* *s*m*a*r*t*"
like image 393
RockScience Avatar asked Jun 17 '11 07:06

RockScience


3 Answers

The problem is that by default strsplit interprets " | " as a regular expression, in which | has special meaning (as "or").

Use fixed argument:

unlist(strsplit("I am | very smart", " | ", fixed=TRUE))
# [1] "I am"       "very smart"

Side effect is faster computation.

stringr alternative:

unlist(stringr::str_split("I am | very smart", fixed(" | ")))
like image 192
Marek Avatar answered Oct 20 '22 00:10

Marek


| is a metacharacter. You need to escape it (using \\ before it).

> unlist(strsplit("I am | very smart", " \\| "))
[1] "I am"       "very smart"
> sub(pattern="\\|", replacement="*", x="I am | very smart")
[1] "I am * very smart"

Edit: The reason you need two backslashes is that the single backslash prefix is reserved for special symbols such as \n (newline) and \t (tab). For more information look in the help page ?regex. The other metacharacters are . \ | ( ) [ { ^ $ * + ?

like image 37
nullglob Avatar answered Oct 20 '22 00:10

nullglob


If you are parsing a table than calling read.table might be a better option. Tiny example:

> txt <- textConnection("I am | very smart")
> read.table(txt, sep='|')
     V1          V2
1 I am   very smart

So I would suggest to fetch the wiki page with Rcurl, grab the interesting part of the page with XML (which has a really neat function to parse HTML tables also) and if HTML format is not available call read.table with specified sep. Good luck!

like image 42
daroczig Avatar answered Oct 19 '22 22:10

daroczig