I have a few strings in a data set that contain the caharacters
\x96
\x92
and others.
I cant figure out how to grep for them in R.
I have tried using
pattern="\x96"
pattern="\\x96"
pattern="x96"
but to no avail.
Is there a specific way of dealing with such characters, specifically in R.
** UPDATE **
as per the suggestion in the comments, perl=TRUE
allows the grep to work
Can anyone offer a solid explanation of what is going on?
session info, in case relevant
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_0.9.3 RMySQL_0.9-3 DBI_0.2-5 stringr_0.6.1 data.table_1.8.6
R supports several different types of regular expressions. The default is POSIX ERE (extended regular expressions), which is the default in grep and other standard posix tools. But the POSIX ERE engine in R does not currently support escaping hex character codes:
Escaping non-metacharacters with a backslash is implementation-dependent. The current implementation interprets \a as BEL, \e as ESC, \f as FF, \n as LF, \r as CR and \t as TAB. (Note that these will be interpreted by R's parser in literal character strings.)
See Regular Expressions as used in R.
Setting perl=TRUE changes the engine used by R to process regular expressions to PCRE (perl-compatible regular expressions). PCRE supports escaped hex character codes -- and voila, your regex now works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With