I thought that the dot .
in regex will match any character, except the end-of-line character.
However, in R, I found that the dot can match anything, including the newline characters \n
, \r
or \r\n
:
grep(c("\r","\n","\r\n"),pattern=".")
[1] 1 2 3
Can someone explain the contradiction?
Let’ take the DOT metacharacter as you’ve seen thus far. The DOT has a special meaning when used inside a regular expression. It matches any character except the new line. However, In the string, the DOT is used to end the sentence.
It was formally added in the ECMAScript 2018 specification. In PowerGREP, tick the checkbox labeled “dot matches line breaks” to make the dot match all characters. In EditPad Pro, turn on the “Dot” or “Dot matches newline” search option. In Perl, the mode where the dot also matches line breaks is called “single-line mode”.
The dot is a very powerful regex metacharacter. It allows you to be lazy. Put in a dot, and everything matches just fine when you test the regex on valid data. The problem is that the regex also matches in cases where it should not match. If you are new to regular expressions, some of these cases may not be so obvious at first.
The Dot Matches (Almost) Any Character In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it is also the most commonly misused metacharacter. The dot matches a single character, without caring what that character is. The only exception are line break characters.
The page here http://www.regular-expressions.info/dot.html explains how the rule that dot does not match the end-of-line character exists mostly for historic reasons:
The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.
However,
Modern tools and languages can apply regular expressions to very large strings or even entire files. Except for JavaScript and VBScript, all regex flavors discussed here have an option to make the dot match all characters, including line breaks.
Apparently, R is one such language where by default, dot will match every character. (I point you to Joshua's comment above, recommending you look at ?regex
and the POSIX 1003.2 standard.)
The page I linked above also mentions Perl and suggests how under its default mode, dot will not match line breaks.
Notice how R's grep
function has a perl
option. If you turn it on, you do get a different output:
> grep(".", c("\r","\n","\r\n"), perl = TRUE)
[1] 1 3
This is telling me that \n
is the line break character, but not \r
. Something that comparing cat("\r")
and cat("\n")
can confirm.
(I'm on a Mac OS if it makes any difference.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With