I'm learning regexp and thought I was starting to get a grip. but then...
I tried to split a string and I need help to understand such a simple thing as:
String input = "abcde";
System.out.println("[a-z] " + Arrays.toString(input.split("[a-z]")));
System.out.println("\\w " + Arrays.toString(input.split("\\w")));
System.out.println("\\w*? " + Arrays.toString(input.split("\\w*?")));
System.out.println("\\w+? " + Arrays.toString(input.split("\\w+?")));
The output is
[a-z] - []
\w - []
\w*? - [, a, b, c, d, e]
\w+? - []
Why doesn't any of the two first lines split the String on any character? The third expression \w*?, (question mark prevents greediness) works as I expected, splitting the String on every character. The star, zero or more matches, returns an empty array.
I've tried the expression within NotePad++ and in a program and it shows 5 matches as in:
Scanner ls = new Scanner(input);
while(ls.hasNext())
System.out.format("%s ", ls.findInLine("\\w");
Output is: a b c d e
This really puzzles me
To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.
A succinct version: \\w+ matches all alphanumeric characters and _ . \\W+ matches all characters except alphanumeric characters and _ . They are opposite.
You do not only have to use literal strings for splitting strings into an array with the split method. You can use regex as breakpoints that match more characters for splitting a string.
If you want to split a string that matches a regular expression (regex) instead of perfect match, use the split() of the re module. In re. split() , specify the regex pattern in the first parameter and the target character string in the second parameter.
Splits an input string into an array of substrings at the positions defined by a specified regular expression pattern. Specified options modify the matching operation. Splits an input string a specified maximum number of times into an array of substrings, at the positions defined by a regular expression specified in the Regex constructor.
Regex to split String into words with multiple word boundary delimiters. In this example, we will use the[\b\W\b]+ regex pattern to cater to any Non-alphanumeric delimiters. Using this pattern we can split string by multiple word boundary delimiters that will result in a list of alphanumeric/word tokens.
Regex splits the string based on a pattern. It handles a delimiter specified as a pattern. This is why Regex is better than string.Split. Here are some examples of how to split a string using Regex in C#.
If (" ") is used as separator, the string is split between words. Optional. The character (or regular expression) to use for splitting. If omitted, an array with the original string is returned.
If you split a string with a regex, you essentially tell where the string should be cut. This necessarily cuts away what you match with the regex. Which means if you split at \w
, then every character is a split point and the substrings between them (all empty) are returned. Java automatically removes trailing empty strings, as described in the documentation.
This also explains why the lazy match \w*?
will give you every character, because it will match every position between (and before and after) any character (zero-width). What's left are the characters of the string themselves.
Let's break it down:
[a-z]
, \w
, \w+?
Your string is
abcde
And the matches are as follows:
a b c d e
└─┘└─┘└─┘└─┘└─┘
which leaves you with the substrings between the matches, all of which are empty.
The above three regexes behave the same in this regard as they all will only match a single character. \w+?
will do so because it lacks any other constraints that might make the +?
try matching more than just the bare minimum (it's lazy, after all).
\w*?
a b c d e
└┘ └┘ └┘ └┘ └┘ └┘
In this case the matches are between the characters, leaving you with the following substrings:
"", "a", "b", "c", "d", "e", ""
Java throws the trailing empty one away, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With