Why does <code>grep</code> treat <code>\n</code> and <code>\\n</code> the same way ? For example, both match <code>hallo\nworld</code>. <pre class="prettyprint"><code>grep("hallo\nworld", pattern="\n") [1] 1 grep("hallo\nworld", pattern="\\n") [1] 1 </code></pre> I see that <code>hallo\nworld</code> is parsed into <pre class="prettyprint"><code>hallo world </code></pre> that is, <code>hallo</code> on one line and <code>world</code> on one line. So in <code>grep("hallo\nworld", pattern="\n")</code>, is the <code>pattern="\n"</code> a new line or <code>\n</code> literally? Also note this happens with others; <code>\a</code> <code>\f</code> <code>\n</code> <code>\t</code> <code>\r</code> and <code>\\a</code> <code>\\f</code> <code>\\n</code> <code>\\t</code> <code>\\r</code> are all treated identically. But <code>\d</code> <code>\w</code> <code>\s</code> can't be used! Why not? I chose different strings to test, and I found the secret in the concept of regular expression. There are two concepts of escape, one is escape in a string, it is simple to understand; the other is escape in a regular pattern expression string. In R a pattern such as <code>grep(x, pattern=" some string here ")</code>, <code>\\n</code>=<code>\n</code>= a newline character. But in common string, <code>\\n</code> !=<code>\n</code> ,the former is literally <code>\n</code>,the latter is a newline character. We can prove this by : <pre class="prettyprint"><code>cat("\n") cat("\\n") \n> </code></pre> How to prove this? I'll try with other characters, not just <code>\n</code>, to see if they match in the same way. <pre class="prettyprint"><code>special1 <- c( "\a", "\f", "\n", "\t", "\r") special2 <- c("\\a","\\f","\\n","\\t","\\r") target <- paste("hallo", special1, "world", sep="") for (i in 1:5){ cat("i=", i, "\n") if( grep(target[i], pattern=special1[i]) == 1) print(paste(target[i], "match", special1[i], "succeed")) if( grep(target[i], pattern=special2[i]) == 1) print(paste(target[i], "match", special2[i], "succeed")) } </code></pre> output: <pre class="prettyprint"><code>i= 1 [1] "hallo\aworld match \a succeed" [1] "hallo\aworld match `\\a` succeed" i= 2 [1] "hallo\fworld match \f succeed" [1] "hallo\fworld match `\\f` succeed" i= 3 [1] "hallo\nworld match \n succeed" [1] "hallo\nworld match `\\n` succeed" i= 4 [1] "hallo\tworld match \t succeed" [1] "hallo\tworld match `\\t` succeed" i= 5 [1] "hallo\rworld match \r succeed" [1] "hallo\rworld match `\\r` succeed" </code></pre> Note that <code>\a</code> <code>\f</code> <code>\n</code> <code>\t</code> <code>\r</code> and <code>\\a</code> <code>\\f</code> <code>\\n</code> <code>\\t</code> <code>\\r</code> were all treated identically in R regular pattern expression string! Not only that, you can not write <code>\d</code> <code>\w</code> <code>\s</code> in an R regular expression pattern! You can write any of these: <pre class="prettyprint"><code>pattern="\a" "pattern=\f" "pattern=\n" "pattern=\t" "pattern=\r" </code></pre> But you can't write any of these! <pre class="prettyprint"><code>pattern="\d" "pattern="\w" "pattern=\s" in grep. </code></pre> I think this is also a bug , as <code>\d</code> <code>\w</code> <code>\s</code> are treated unequally to <code>\a</code> <code>\f</code> <code>\n</code> <code>\t</code> <code>\r</code>.

The reason why <code>\n</code>, <code>\\n</code> and <code>\\\n</code> all match is because of double evaluation of the search pattern. I observed this by running a couple of examples: <pre class="prettyprint"><code>grep("hello\nworld", pattern="\n") [1] 1 grep("hello\nworld", pattern="\\n") [1] 1 > grep("hello\nworld", pattern="\\\n") [1] 1 > grep("hello\nworld", pattern="\\\\n") integer(0) > grep("hello\\nworld", pattern="\\\\n") [1] 1 </code></pre> Keep in mind the rules of evaluating backslash escape sequences: <ul> <li> <code>\\</code> is replaced with a <code>\</code> </li> <li> <code>\n</code> is replaced with a <code>NEWLINE</code> character</li> <li> <code>\</code> + <code>NEWLINE</code> is replaced with a <code>NEWLINE</code> character</li> <li>(see the docs in <code>?regex</code> for more details)</li> </ul> With this in mind, if you evaluate the pattern twice, you get: <ol> <li> <code>\n</code> => <code>NEWLINE</code> => <code>NEWLINE</code> </li> <li> <code>\\n</code> => <code>\n</code> => <code>NEWLINE</code> </li> <li> <code>\\\n</code> => <code>\</code> + <code>NEWLINE</code> => <code>NEWLINE</code> </li> <li> <code>\\\\n</code> => <code>\\n</code> => <code>\n</code> </li> <li> <code>\\\\\n</code> => <code>\\</code> + <code>NEWLINE</code> => <code>\</code> + <code>NEWLINE</code> </li> <li> <code>\\\\\\n</code> => <code>\\\n</code> => <code>\</code> + <code>NEWLINE</code> </li> <li> <code>\\\\\\\n</code> => <code>\\\</code> + <code>NEWLINE</code> => <code>\</code> + <code>NEWLINE</code> </li> <li> <code>\\\\\\\\n</code> => <code>\\\\n</code> => <code>\\n</code> </li> </ol> And so on. Examples 1-3 all evaluate to a single <code>NEWLINE</code>, that's why these patterns will match. (At the same time, the string you're trying to match against the pattern is evaluated only once.) A discussion on the R mailing list posted by @Aaron explains the double evaluation like this: <blockquote> There are two levels [of evaluation] because backslashes are escape characters both to R strings and regular expressions. </blockquote> Note that other languages don't evaluate patterns like this. Take for example Python: <pre class="prettyprint"><code>import re >>> re.search(r'\n', 'hello\nworld') is not None True >>> re.search(r'\\n', 'hello\nworld') is not None False </code></pre> Or Perl: <pre class="prettyprint"><code>$ perl -e 'print "hello\nworld" =~ /\n/ || 0, "\n"' 1 $ perl -e 'print "hello\nworld" =~ /\\n/ || 0, "\n"' 0 </code></pre> And we could go on. So the double evaluation in <code>R</code> seems unusual. Why is it implemented this way? I think the ultimate answer lies with R-devel. ACKNOWLEDGEMENTS I thank @Aaron whose critical comments helped improving this answer.

Note that the backslash itself is special, you have to escape the backslash with a backslash. The <code>\\n</code> means "I really want to match a newline character, not literal <code>\n</code>" <pre class="prettyprint"><code>grep("hallo\nworld", pattern = "\\n") [1] 1 grep("hallo\\nworld", pattern = "\\\\n") [1] 1 </code></pre>

Why can 'hallo\nworld' match both \n and \\n in R?

Tags:

regex

r

escaping

Why does grep treat \n and \\n the same way ?

For example, both match hallo\nworld.

grep("hallo\nworld", pattern="\n")
[1] 1
grep("hallo\nworld", pattern="\\n")
[1] 1

I see that hallo\nworld is parsed into

hallo  
world

that is, hallo on one line and world on one line.

So in grep("hallo\nworld", pattern="\n"), is the pattern="\n" a new line or \n literally?

Also note this happens with others; \a \f \n \t \r and \\a \\f \\n \\t \\r are all treated identically. But \d \w \s can't be used! Why not?

I chose different strings to test, and I found the secret in the concept of regular expression.

There are two concepts of escape, one is escape in a string, it is simple to understand; the other is escape in a regular pattern expression string. In R a pattern such as grep(x, pattern=" some string here "), \\n=\n= a newline character. But in common string, \\n !=\n ,the former is literally \n,the latter is a newline character. We can prove this by :

cat("\n")

cat("\\n")
\n>

How to prove this? I'll try with other characters, not just \n, to see if they match in the same way.

special1 <- c( "\a", "\f", "\n", "\t", "\r")
special2 <- c("\\a","\\f","\\n","\\t","\\r")
target <- paste("hallo", special1, "world", sep="")
for (i in 1:5){
    cat("i=", i, "\n")
    if( grep(target[i], pattern=special1[i]) == 1)
        print(paste(target[i], "match", special1[i], "succeed"))
    if( grep(target[i], pattern=special2[i]) == 1)
        print(paste(target[i], "match", special2[i], "succeed"))
}

output:

i= 1   
[1] "hallo\aworld match \a succeed"  
[1] "hallo\aworld match `\\a` succeed"  
i= 2   
[1] "hallo\fworld match \f succeed"  
[1] "hallo\fworld match `\\f` succeed"  
i= 3   
[1] "hallo\nworld match \n succeed"  
[1] "hallo\nworld match `\\n` succeed"  
i= 4   
[1] "hallo\tworld match \t succeed"  
[1] "hallo\tworld match `\\t` succeed"  
i= 5   
[1] "hallo\rworld match \r succeed"  
[1] "hallo\rworld match `\\r` succeed"

Note that \a \f \n \t \r and \\a \\f \\n \\t \\r were all treated identically in R regular pattern expression string!

Not only that, you can not write \d \w \s in an R regular expression pattern!
You can write any of these:

pattern="\a" "pattern=\f" "pattern=\n" "pattern=\t" "pattern=\r"

But you can't write any of these!

pattern="\d" "pattern="\w" "pattern=\s"  in grep.

I think this is also a bug , as \d \w \s are treated unequally to \a \f \n \t \r.

348

asked Dec 07 '13 07:12

showkey

2 Answers

The reason why \n, \\n and \\\n all match is because of double evaluation of the search pattern. I observed this by running a couple of examples:

grep("hello\nworld", pattern="\n")
[1] 1
grep("hello\nworld", pattern="\\n")
[1] 1
> grep("hello\nworld", pattern="\\\n")
[1] 1
> grep("hello\nworld", pattern="\\\\n")
integer(0)
> grep("hello\\nworld", pattern="\\\\n")
[1] 1

Keep in mind the rules of evaluating backslash escape sequences:

\\ is replaced with a \
\n is replaced with a NEWLINE character
\ + NEWLINE is replaced with a NEWLINE character
(see the docs in ?regex for more details)

With this in mind, if you evaluate the pattern twice, you get:

\n => NEWLINE => NEWLINE
\\n => \n => NEWLINE
\\\n => \ + NEWLINE => NEWLINE
\\\\n => \\n => \n
\\\\\n => \\ + NEWLINE => \ + NEWLINE
\\\\\\n => \\\n => \ + NEWLINE
\\\\\\\n => \\\ + NEWLINE => \ + NEWLINE
\\\\\\\\n => \\\\n => \\n

And so on. Examples 1-3 all evaluate to a single NEWLINE, that's why these patterns will match. (At the same time, the string you're trying to match against the pattern is evaluated only once.)

A discussion on the R mailing list posted by @Aaron explains the double evaluation like this:

There are two levels [of evaluation] because backslashes are escape characters both to R strings and regular expressions.

Note that other languages don't evaluate patterns like this. Take for example Python:

import re
>>> re.search(r'\n', 'hello\nworld') is not None
True
>>> re.search(r'\\n', 'hello\nworld') is not None
False

Or Perl:

$ perl -e 'print "hello\nworld" =~ /\n/ || 0, "\n"'
1
$ perl -e 'print "hello\nworld" =~ /\\n/ || 0, "\n"'
0

And we could go on. So the double evaluation in R seems unusual. Why is it implemented this way? I think the ultimate answer lies with R-devel.

ACKNOWLEDGEMENTS

I thank @Aaron whose critical comments helped improving this answer.

178

answered Oct 15 '22 02:10

janos

Note that the backslash itself is special, you have to escape the backslash with a backslash.

The \\n means "I really want to match a newline character, not literal \n"

grep("hallo\nworld", pattern = "\\n")
[1] 1

grep("hallo\\nworld", pattern = "\\\\n")
[1] 1

answered Oct 15 '22 00:10

hwnd

Related questions
                            
                                What is the \& pattern in Vim's Regex
                            
                                Java: Assign a variable within lambda
                            
                                Separating column using separate (tidyr) via dplyr on a first encountered digit
                            
                                is there need for a more declarative way of expressing regular expressions ? :)
                            
                                Non-greedy Regular Expression in Java
                            
                                Why does C++11 support 6 different regular expression grammars?
                            
                                When do we actually use a Trie?
                            
                                Regular expression "empty range in char class error"
                            
                                why regexec() in posix c always return the first match,how can it return all match positions only run once?
                            
                                re.search Multiple lines Python
                            
                                How to develop custom filters for the Imagus hover zoom extension?
                            
                                Regex, select closest match
                            
                                What is the difference between an anchored regex and an un-anchored regex?
                            
                                How to use named regex groups in ack output?
                            
                                How to match--but not capture--in Python regular expressions?
                            
                                Concurrently using std::regex, defined behaviour?
                            
                                Perl warning: Use of uninitialized value in concatenation (.) or string
                            
                                Search for files in a git repository by extensions
                            
                                Python re.findall() is not working as expected
                            
                                How to implement a verbose REGEX in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With