I was trying to answer a regex question for someone and I came across something that made me scratch my head. Giving the following code...
public static void main(String[] args) throws IOException {
String test = "Hello, how are you today?";
Pattern p = Pattern.compile("(\\W)+");
String[] words = p.split(test);
System.out.println("--" + words[0] + "--");
System.out.println("--" + words[1] + "--");
}
I get the expected results of
--Hello--
--how--
However when I use ...
public static void main(String[] args) throws IOException {
String test = "Hello, how are you today?";
Pattern p = Pattern.compile("(\\W)*");
String[] words = p.split(test);
System.out.println("--" + words[0] + "--");
System.out.println("--" + words[1] + "--");
}
I get the results
----
--H--
Is there a reason * doesn't work exactly like the + in this situation?
* matches zero or more. As a result, everything becomes a delimiter (zero width delimiters)
By the way, that doesn't mean it's acting non-greedily. If you look at the characters returned you get this:
[, H, e, l, l, o, , h, o, w, , a, r, e, , y, o, u, , t, o, d, a, y]
Notice how there are not two empty elements between "o" and "h"; just one. Below, each delimiter is surrounded by {}.
{}H{}e{}l{}l{}o{, }{}h{}o{}w{ }{}a{}r{}e{ }{}y{}o{}u{ }{}t{}o{}d{}a{}y{?}
Because + means one or more occurrences of the previous match whereas * means zero or more occurrences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With