Why does grep
treat \n
and \\n
the same way ?
For example, both match hallo\nworld
.
grep("hallo\nworld", pattern="\n")
[1] 1
grep("hallo\nworld", pattern="\\n")
[1] 1
I see that hallo\nworld
is parsed into
hallo
world
that is, hallo
on one line and world
on one line.
So in grep("hallo\nworld", pattern="\n")
, is the pattern="\n"
a new line or \n
literally?
Also note this happens with others; \a
\f
\n
\t
\r
and \\a
\\f
\\n
\\t
\\r
are all treated identically. But \d
\w
\s
can't be used! Why not?
I chose different strings to test, and I found the secret in the concept of regular expression.
There are two concepts of escape, one is escape in a string, it is simple to understand; the other is escape in a regular pattern expression string. In R a pattern such as grep(x, pattern=" some string here ")
, \\n
=\n
= a newline character. But in common string, \\n
!=\n
,the former is literally \n
,the latter is a newline character. We can prove this by :
cat("\n")
cat("\\n")
\n>
How to prove this? I'll try with other characters, not just \n
, to see if they match in the same way.
special1 <- c( "\a", "\f", "\n", "\t", "\r")
special2 <- c("\\a","\\f","\\n","\\t","\\r")
target <- paste("hallo", special1, "world", sep="")
for (i in 1:5){
cat("i=", i, "\n")
if( grep(target[i], pattern=special1[i]) == 1)
print(paste(target[i], "match", special1[i], "succeed"))
if( grep(target[i], pattern=special2[i]) == 1)
print(paste(target[i], "match", special2[i], "succeed"))
}
output:
i= 1
[1] "hallo\aworld match \a succeed"
[1] "hallo\aworld match `\\a` succeed"
i= 2
[1] "hallo\fworld match \f succeed"
[1] "hallo\fworld match `\\f` succeed"
i= 3
[1] "hallo\nworld match \n succeed"
[1] "hallo\nworld match `\\n` succeed"
i= 4
[1] "hallo\tworld match \t succeed"
[1] "hallo\tworld match `\\t` succeed"
i= 5
[1] "hallo\rworld match \r succeed"
[1] "hallo\rworld match `\\r` succeed"
Note that \a
\f
\n
\t
\r
and \\a
\\f
\\n
\\t
\\r
were all treated identically in R regular pattern expression string!
Not only that, you can not write \d
\w
\s
in an R regular expression pattern!
You can write any of these:
pattern="\a" "pattern=\f" "pattern=\n" "pattern=\t" "pattern=\r"
But you can't write any of these!
pattern="\d" "pattern="\w" "pattern=\s" in grep.
I think this is also a bug , as \d
\w
\s
are treated unequally to \a
\f
\n
\t
\r
.
"\n" matches a newline character.
'\n' means a literal backslash followed by the letter n, whereas "\n" means the newline character. Last, the special variable $/ is the record separator which is "\n" by default, which is why you don't need to specify the separator in the above example.
The reason why \n
, \\n
and \\\n
all match is because of double evaluation of the search pattern. I observed this by running a couple of examples:
grep("hello\nworld", pattern="\n")
[1] 1
grep("hello\nworld", pattern="\\n")
[1] 1
> grep("hello\nworld", pattern="\\\n")
[1] 1
> grep("hello\nworld", pattern="\\\\n")
integer(0)
> grep("hello\\nworld", pattern="\\\\n")
[1] 1
Keep in mind the rules of evaluating backslash escape sequences:
\\
is replaced with a \
\n
is replaced with a NEWLINE
character\
+ NEWLINE
is replaced with a NEWLINE
character?regex
for more details)With this in mind, if you evaluate the pattern twice, you get:
\n
=> NEWLINE
=> NEWLINE
\\n
=> \n
=> NEWLINE
\\\n
=> \
+ NEWLINE
=> NEWLINE
\\\\n
=> \\n
=> \n
\\\\\n
=> \\
+ NEWLINE
=> \
+ NEWLINE
\\\\\\n
=> \\\n
=> \
+ NEWLINE
\\\\\\\n
=> \\\
+ NEWLINE
=> \
+ NEWLINE
\\\\\\\\n
=> \\\\n
=> \\n
And so on. Examples 1-3 all evaluate to a single NEWLINE
, that's why these patterns will match. (At the same time, the string you're trying to match against the pattern is evaluated only once.)
A discussion on the R mailing list posted by @Aaron explains the double evaluation like this:
There are two levels [of evaluation] because backslashes are escape characters both to R strings and regular expressions.
Note that other languages don't evaluate patterns like this. Take for example Python:
import re
>>> re.search(r'\n', 'hello\nworld') is not None
True
>>> re.search(r'\\n', 'hello\nworld') is not None
False
Or Perl:
$ perl -e 'print "hello\nworld" =~ /\n/ || 0, "\n"'
1
$ perl -e 'print "hello\nworld" =~ /\\n/ || 0, "\n"'
0
And we could go on. So the double evaluation in R
seems unusual. Why is it implemented this way? I think the ultimate answer lies with R-devel.
ACKNOWLEDGEMENTS
I thank @Aaron whose critical comments helped improving this answer.
Note that the backslash itself is special, you have to escape the backslash with a backslash.
The \\n
means "I really want to match a newline character, not literal \n
"
grep("hallo\nworld", pattern = "\\n")
[1] 1
grep("hallo\\nworld", pattern = "\\\\n")
[1] 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With