Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can 'hallo\nworld' match both \n and \\n in R?

Tags:

regex

r

escaping

Why does grep treat \n and \\n the same way ?

For example, both match hallo\nworld.

grep("hallo\nworld", pattern="\n")
[1] 1
grep("hallo\nworld", pattern="\\n")
[1] 1

I see that hallo\nworld is parsed into

hallo  
world

that is, hallo on one line and world on one line.

So in grep("hallo\nworld", pattern="\n"), is the pattern="\n" a new line or \n literally?

Also note this happens with others; \a \f \n \t \r and \\a \\f \\n \\t \\r are all treated identically. But \d \w \s can't be used! Why not?

I chose different strings to test, and I found the secret in the concept of regular expression.

There are two concepts of escape, one is escape in a string, it is simple to understand; the other is escape in a regular pattern expression string. In R a pattern such as grep(x, pattern=" some string here "), \\n=\n= a newline character. But in common string, \\n !=\n ,the former is literally \n,the latter is a newline character. We can prove this by :

cat("\n")

cat("\\n")
\n> 

How to prove this? I'll try with other characters, not just \n, to see if they match in the same way.

special1 <- c( "\a", "\f", "\n", "\t", "\r")
special2 <- c("\\a","\\f","\\n","\\t","\\r")
target <- paste("hallo", special1, "world", sep="")
for (i in 1:5){
    cat("i=", i, "\n")
    if( grep(target[i], pattern=special1[i]) == 1)
        print(paste(target[i], "match", special1[i], "succeed"))
    if( grep(target[i], pattern=special2[i]) == 1)
        print(paste(target[i], "match", special2[i], "succeed"))
}

output:

i= 1   
[1] "hallo\aworld match \a succeed"  
[1] "hallo\aworld match `\\a` succeed"  
i= 2   
[1] "hallo\fworld match \f succeed"  
[1] "hallo\fworld match `\\f` succeed"  
i= 3   
[1] "hallo\nworld match \n succeed"  
[1] "hallo\nworld match `\\n` succeed"  
i= 4   
[1] "hallo\tworld match \t succeed"  
[1] "hallo\tworld match `\\t` succeed"  
i= 5   
[1] "hallo\rworld match \r succeed"  
[1] "hallo\rworld match `\\r` succeed" 

Note that \a \f \n \t \r and \\a \\f \\n \\t \\r were all treated identically in R regular pattern expression string!

Not only that, you can not write \d \w \s in an R regular expression pattern!
You can write any of these:

pattern="\a" "pattern=\f" "pattern=\n" "pattern=\t" "pattern=\r"

But you can't write any of these!

pattern="\d" "pattern="\w" "pattern=\s"  in grep.

I think this is also a bug , as \d \w \s are treated unequally to \a \f \n \t \r.

like image 348
showkey Avatar asked Dec 07 '13 07:12

showkey


People also ask

What does N mean in regex?

"\n" matches a newline character.

What does N mean in Ruby?

'\n' means a literal backslash followed by the letter n, whereas "\n" means the newline character. Last, the special variable $/ is the record separator which is "\n" by default, which is why you don't need to specify the separator in the above example.


2 Answers

The reason why \n, \\n and \\\n all match is because of double evaluation of the search pattern. I observed this by running a couple of examples:

grep("hello\nworld", pattern="\n")
[1] 1
grep("hello\nworld", pattern="\\n")
[1] 1
> grep("hello\nworld", pattern="\\\n")
[1] 1
> grep("hello\nworld", pattern="\\\\n")
integer(0)
> grep("hello\\nworld", pattern="\\\\n")
[1] 1

Keep in mind the rules of evaluating backslash escape sequences:

  • \\ is replaced with a \
  • \n is replaced with a NEWLINE character
  • \ + NEWLINE is replaced with a NEWLINE character
  • (see the docs in ?regex for more details)

With this in mind, if you evaluate the pattern twice, you get:

  1. \n => NEWLINE => NEWLINE
  2. \\n => \n => NEWLINE
  3. \\\n => \ + NEWLINE => NEWLINE
  4. \\\\n => \\n => \n
  5. \\\\\n => \\ + NEWLINE => \ + NEWLINE
  6. \\\\\\n => \\\n => \ + NEWLINE
  7. \\\\\\\n => \\\ + NEWLINE => \ + NEWLINE
  8. \\\\\\\\n => \\\\n => \\n

And so on. Examples 1-3 all evaluate to a single NEWLINE, that's why these patterns will match. (At the same time, the string you're trying to match against the pattern is evaluated only once.)

A discussion on the R mailing list posted by @Aaron explains the double evaluation like this:

There are two levels [of evaluation] because backslashes are escape characters both to R strings and regular expressions.

Note that other languages don't evaluate patterns like this. Take for example Python:

import re
>>> re.search(r'\n', 'hello\nworld') is not None
True
>>> re.search(r'\\n', 'hello\nworld') is not None
False

Or Perl:

$ perl -e 'print "hello\nworld" =~ /\n/ || 0, "\n"'
1
$ perl -e 'print "hello\nworld" =~ /\\n/ || 0, "\n"'
0

And we could go on. So the double evaluation in R seems unusual. Why is it implemented this way? I think the ultimate answer lies with R-devel.

ACKNOWLEDGEMENTS

I thank @Aaron whose critical comments helped improving this answer.

like image 178
janos Avatar answered Oct 15 '22 02:10

janos


Note that the backslash itself is special, you have to escape the backslash with a backslash.

The \\n means "I really want to match a newline character, not literal \n"

grep("hallo\nworld", pattern = "\\n")
[1] 1

grep("hallo\\nworld", pattern = "\\\\n")
[1] 1
like image 35
hwnd Avatar answered Oct 15 '22 00:10

hwnd