I'm making a text based dice roller. It takes in strings like "2d10+5" and returns a string as a result of the roll(s). My problem is showing up in the tokenizer that splits the string into useful parts for me to parse into information.
String[] tokens = message.split("(?=[dk\\+\\-])");
This is yielding strange, unexpected results. I don't know exactly what is causing them. It could be the regex, my misunderstanding, or Java just being Java. Here's what's happening:
3d6+4
yields the string array [3, d6, +4]
. This is correct.d%
yields the string array [d%]
. This is correct.d20
yields the string array [d20]
. This is correct.d%+3
yields the string array [, d%, +3]
. This is incorrect. d20+2
yields the string array [, d20, +2]
. This is incorrect. In the fourth and fifth example, something strange is causing an extra empty string to appear at the front of the array. It's not the lack of number at the front of the string, as other examples disprove that. It's not the presence of the percentage sign, nor the the plus sign.
For now I'm just continuing through the for loop on blank strings, but that feels sorta like a band-aid solution. Does anyone have any idea what causes the blank string at the front of the array? How can I fix it?
If the delimiter is an empty string, the split() method will return an array of elements, one element for each character of string. If you specify an empty string for string, the split() method will return an empty string and not an array of strings.
The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
Java split string on empty delimiter returns empty string at the beginning - Intellipaat Community.
You can split a string by each character using an empty string('') as the splitter. In the example below, we split the same message using an empty string. The result of the split will be an array containing all the characters in the message string.
Digging through the source code, I got the exact issue behind this behaviour.
The String.split()
method internally uses Pattern.split()
. The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is 0
, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element.
Here's the source code:
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<String>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
// Consider this assignment. For a single empty string match
// m.end() will be 0, and hence index will also be 0
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Rest of them is not required
If the last condition in the above code - index == 0
, is true, then the single element array is returned with the input string.
Now, consider the cases when the index
can be 0
.
If the match is found at the beginning, and the length of matched string is 0
, then the value of index in the if
block (inside the while
loop) -
index = m.end();
will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else index
would be updated to a different index.
So, considering your cases:
For d%
, there is just a single match for the pattern, before the first d
. Hence the index value would be 0
. But since there isn't any further matches, the index value is not updated, and the if
condition becomes true
, and returns the single element array with original string.
For d20+2
there would be two matches, one before d
, and one before +
. So index value will be updated, and hence the ArrayList
in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.
So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern):
"(?<!^)(?=[dk+-])" // You don't need to escape + and hyphen(when at the end)
this will split on empty string followed by your character class, but not preceded by the beginning of the string.
Consider the case of splitting the string "ad%"
on regex pattern - "a(?=[dk+-])"
. This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with a
:
"ad%".split("a(?=[dk+-])"); // Prints - `[, d%]`
Why? That's because the length of the matched string is 1
. So the index value after the first match - m.end()
wouldn't be 0
but 1
, and hence the single element array won't be returned.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With