I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):
<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>
Then I optimised it and replaced [\s\S]*?
with .*?
It suddenly stopped recognising the xml.
As far as I know, \s
means all white-space symbols and \S
means all non white-spaced symbols or [^\s]
so [\s\S]
logically should be equivalent to .
I didn't use greedy filters, so what could be the difference?
The Difference Between \s and \s+ For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.
*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1 , eventually matching 101 . All quantifiers have a non-greedy mode: .
As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to .
\\s+ - matches sequence of one or more whitespace characters.
The regular expressions .
and \s\S
are not equivalent, since .
doesn't catch line terminators (like new line) by default.
According to the oracle website, .
matches
Any character (may or may not match line terminators)
while a line terminator is any of the following:
- A newline (line feed) character (
'\n'
),- A carriage-return character followed immediately by a newline character (
"\r\n"
),- A standalone carriage-return character (
'\r'
),- A next-line character (
'\u0085'
),- A line-separator character (
'\u2028'
), or- A paragraph-separator character (
'\u2029
).
The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:
If
UNIX_LINES
mode is activated, then the only line terminators recognized are newline characters.The regular expression
.
matches any character except a line terminator unless theDOTALL
flag is specified.
Here is a sheet explaining all the regex commands.
Basically, \s\S
will pickup all characters, including newlines. Whereas .
does not pickup line terminators per default (certain flags need to be set to pick them up).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With