I'm making a text based dice roller. It takes in strings like "2d10+5" and returns a string as a result of the roll(s). My problem is showing up in the tokenizer that splits the string into useful parts for me to parse into information. <code>String[] tokens = message.split("(?=[dk\\+\\-])");</code> This is yielding strange, unexpected results. I don't know exactly what is causing them. It could be the regex, my misunderstanding, or Java just being Java. Here's what's happening: <ul> <li> <code>3d6+4</code> yields the string array <code>[3, d6, +4]</code>. This is correct.</li> <li> <code>d%</code> yields the string array <code>[d%]</code>. This is correct.</li> <li> <code>d20</code> yields the string array <code>[d20]</code>. This is correct.</li> <li> <code>d%+3</code> yields the string array <code>[, d%, +3]</code>. This is incorrect. </li> <li> <code>d20+2</code> yields the string array <code>[, d20, +2]</code>. This is incorrect. </li> </ul> In the fourth and fifth example, something strange is causing an extra empty string to appear at the front of the array. It's not the lack of number at the front of the string, as other examples disprove that. It's not the presence of the percentage sign, nor the the plus sign. For now I'm just continuing through the for loop on blank strings, but that feels sorta like a band-aid solution. Does anyone have any idea what causes the blank string at the front of the array? How can I fix it?

Digging through the source code, I got the exact issue behind this behaviour. The <code>String.split()</code> method internally uses <code>Pattern.split()</code>. The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is <code>0</code>, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element. Here's the source code: <pre class="prettyprint"><code>public String[] split(CharSequence input, int limit) { int index = 0; boolean matchLimited = limit > 0; ArrayList<String> matchList = new ArrayList<String>(); Matcher m = matcher(input); // Add segments before each match found while(m.find()) { if (!matchLimited || matchList.size() < limit - 1) { String match = input.subSequence(index, m.start()).toString(); matchList.add(match); // Consider this assignment. For a single empty string match // m.end() will be 0, and hence index will also be 0 index = m.end(); } else if (matchList.size() == limit - 1) { // last one String match = input.subSequence(index, input.length()).toString(); matchList.add(match); index = m.end(); } } // If no match was found, return this if (index == 0) return new String[] {input.toString()}; // Rest of them is not required </code></pre> If the last condition in the above code - <code>index == 0</code>, is true, then the single element array is returned with the input string. Now, consider the cases when the <code>index</code> can be <code>0</code>. <ol> <li>When there is no match at all. (As already in the comment above that condition)</li> <li> If the match is found at the beginning, and the length of matched string is <code>0</code>, then the value of index in the <code>if</code> block (inside the <code>while</code> loop) - <pre class="prettyprint"><code>index = m.end(); </code></pre> will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else <code>index</code> would be updated to a different index. </li> </ol> So, considering your cases: <ul> <li>For <code>d%</code>, there is just a single match for the pattern, before the first <code>d</code>. Hence the index value would be <code>0</code>. But since there isn't any further matches, the index value is not updated, and the <code>if</code> condition becomes <code>true</code>, and returns the single element array with original string.</li> <li>For <code>d20+2</code> there would be two matches, one before <code>d</code>, and one before <code>+</code>. So index value will be updated, and hence the <code>ArrayList</code> in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.</li> </ul> So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern): <pre class="prettyprint"><code>"(?<!^)(?=[dk+-])" // You don't need to escape + and hyphen(when at the end) </code></pre> this will split on empty string followed by your character class, but not preceded by the beginning of the string. <hr> Consider the case of splitting the string <code>"ad%"</code> on regex pattern - <code>"a(?=[dk+-])"</code>. This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with <code>a</code>: <pre class="prettyprint"><code>"ad%".split("a(?=[dk+-])"); // Prints - `[, d%]` </code></pre> Why? That's because the length of the matched string is <code>1</code>. So the index value after the first match - <code>m.end()</code> wouldn't be <code>0</code> but <code>1</code>, and hence the single element array won't be returned.

Java String.split() sometimes giving blank strings

Tags:

java

string

regex

split

I'm making a text based dice roller. It takes in strings like "2d10+5" and returns a string as a result of the roll(s). My problem is showing up in the tokenizer that splits the string into useful parts for me to parse into information.

String[] tokens = message.split("(?=[dk\\+\\-])");

This is yielding strange, unexpected results. I don't know exactly what is causing them. It could be the regex, my misunderstanding, or Java just being Java. Here's what's happening:

3d6+4 yields the string array [3, d6, +4]. This is correct.
d% yields the string array [d%]. This is correct.
d20 yields the string array [d20]. This is correct.
d%+3 yields the string array [, d%, +3]. This is incorrect.
d20+2 yields the string array [, d20, +2]. This is incorrect.

In the fourth and fifth example, something strange is causing an extra empty string to appear at the front of the array. It's not the lack of number at the front of the string, as other examples disprove that. It's not the presence of the percentage sign, nor the the plus sign.

For now I'm just continuing through the for loop on blank strings, but that feels sorta like a band-aid solution. Does anyone have any idea what causes the blank string at the front of the array? How can I fix it?

338

asked Sep 18 '13 11:09

Corey Noel

1 Answers

Digging through the source code, I got the exact issue behind this behaviour.

The String.split() method internally uses Pattern.split(). The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is 0, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element.

Here's the source code:

public String[] split(CharSequence input, int limit) {
        int index = 0;
        boolean matchLimited = limit > 0;
        ArrayList<String> matchList = new ArrayList<String>();
        Matcher m = matcher(input);

        // Add segments before each match found
        while(m.find()) {
            if (!matchLimited || matchList.size() < limit - 1) {
                String match = input.subSequence(index, m.start()).toString();
                matchList.add(match);

                // Consider this assignment. For a single empty string match
                // m.end() will be 0, and hence index will also be 0
                index = m.end();
            } else if (matchList.size() == limit - 1) { // last one
                String match = input.subSequence(index,
                                                 input.length()).toString();
                matchList.add(match);
                index = m.end();
            }
        }

        // If no match was found, return this
        if (index == 0)
            return new String[] {input.toString()};

        // Rest of them is not required

If the last condition in the above code - index == 0, is true, then the single element array is returned with the input string.

Now, consider the cases when the index can be 0.

When there is no match at all. (As already in the comment above that condition)
If the match is found at the beginning, and the length of matched string is 0, then the value of index in the if block (inside the while loop) -
```
index = m.end();
```
will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else index would be updated to a different index.

So, considering your cases:

For d%, there is just a single match for the pattern, before the first d. Hence the index value would be 0. But since there isn't any further matches, the index value is not updated, and the if condition becomes true, and returns the single element array with original string.
For d20+2 there would be two matches, one before d, and one before +. So index value will be updated, and hence the ArrayList in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.

So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern):

"(?<!^)(?=[dk+-])"  // You don't need to escape + and hyphen(when at the end)

this will split on empty string followed by your character class, but not preceded by the beginning of the string.

Consider the case of splitting the string "ad%" on regex pattern - "a(?=[dk+-])". This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with a:

"ad%".split("a(?=[dk+-])");  // Prints - `[, d%]`

Why? That's because the length of the matched string is 1. So the index value after the first match - m.end() wouldn't be 0 but 1, and hence the single element array won't be returned.

answered Sep 22 '22 18:09

Rohit Jain

Related questions
                            
                                Extracting noun phrases from a text file using stanford typed parser
                            
                                Monospaced font/symbols for JTextPane
                            
                                How to build a distributed java application?
                            
                                Convert a byte or int to bitset
                            
                                Pass parameters to Spring MethodInvokingFactoryBean arguments list
                            
                                Server Name Indication (SNI) on Java
                            
                                Criteria eager fetch-joined collection to avoid n+1 selects
                            
                                How to remotely profile a web application with JProfiler?
                            
                                Is there an XML tag that is equivalent to `ListView.addHeaderView'?
                            
                                Mysterious milliseconds number found in Java Date/Calendar object
                            
                                How to set up maven 3 local plugin repository
                            
                                The flush method of OutputStream does nothing?
                            
                                Convert a classpath filename to a real filename
                            
                                How to extract values from bundle in Android
                            
                                Selenium WebDriver and HTML Window location by using Java
                            
                                WordNet Java API [closed]
                            
                                Troubleshooting 'Too many files open' with lsof
                            
                                JSON and Generics in Java - Type safety warning
                            
                                How to make Apache Tomcat accept DELETE method
                            
                                Hibernate Criteria: distinct entities and then limit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With