Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whats the difference between [\s\S]*? and .*? in Java regular expressions?

Tags:

java

regex

xml

I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):

<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>

Then I optimised it and replaced [\s\S]*? with .*? It suddenly stopped recognising the xml.

As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to . I didn't use greedy filters, so what could be the difference?

like image 334
Dmitry Avatar asked Feb 07 '16 02:02

Dmitry


People also ask

What does [\ s mean in regex?

The Difference Between \s and \s+ For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

What is the difference between .*? And * regular expressions?

*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1 , eventually matching 101 . All quantifiers have a non-greedy mode: .

What is the difference between S and S in Java?

As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to .

What does it mean \\ s+?

\\s+ - matches sequence of one or more whitespace characters.


2 Answers

The regular expressions . and \s\S are not equivalent, since . doesn't catch line terminators (like new line) by default.

According to the oracle website, . matches

Any character (may or may not match line terminators)

while a line terminator is any of the following:

  • A newline (line feed) character ('\n'),
  • A carriage-return character followed immediately by a newline character ("\r\n"),
  • A standalone carriage-return character ('\r'),
  • A next-line character ('\u0085'),
  • A line-separator character ('\u2028'), or
  • A paragraph-separator character ('\u2029).

The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:

If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters.

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.

like image 95
Neuron Avatar answered Oct 13 '22 01:10

Neuron


Here is a sheet explaining all the regex commands.

Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).

like image 38
z7r1k3 Avatar answered Oct 12 '22 23:10

z7r1k3