Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does string.split with a regular expression that contains a capturing group return an array that ends with an empty string?

Tags:

I'd like to split an input string on the first colon that still has characters after it on the same line.

For this, I am using the regular expression /:(.+)/

So given the string

aaa: bbb:ccc 

I'd expect an output of

["aaa:\nbbb", "ccc"] 

And given the string

aaa:bbb:ccc 

I'd expect an output of

["aaa", "bbb:ccc"] 

Yet when I actually run these commands, I get

["aaa:\nbbb", "ccc", ""] ["aaa", "bbb:ccc", ""] 

As output.

So somehow, javascript is adding an empty string to the end of the array.

I have checked the documentation for String.split and whilst it does mention that if you perform string.split on an empty string with a specified separator, you'll get an array with 1 empty string in it (and not empty array). It makes no mention of there always being an empty string in the output, or a warning that you may get this result if you make a common mistake or something.

I'd understand if my input string had a colon at the end or something like that; then it splits at the colon and the rest of the match is empty string. That's the issue mentioned in Splitting string with regular expression to make it array without empty element - but I don't have this issue, as my input string does not end with my separator.

I know a quick solution in my case will be to just simply limit the amount of matches, via "aaa:bbb:ccc".split(/:(.+)/, 2), but I'm still curious:

Why does this string.split call return an array ending with an empty string?

like image 282
Pimgd Avatar asked Jul 08 '16 07:07

Pimgd


People also ask

What is the purpose of the split on a string?

Split is used to break a delimited string into substrings. You can use either a character array or a string array to specify zero or more delimiting characters or strings. If no delimiting characters are specified, the string is split at white-space characters.

Does Split always return an array?

The split() method does not change the value of the original string. If the delimiter is an empty string, the split() method will return an array of elements, one element for each character of string. If you specify an empty string for string, the split() method will return an empty string and not an array of strings.

When splitting a string using a given separator it returns?

Using split() When the string is empty and no separator is specified, split() returns an array containing one empty string, rather than an empty array. If the string and separator are both empty strings, an empty array is returned.

What divides a string into an array of substrings?

The split() method splits (divides) a string into two or more substrings depending on a splitter (or divider). The splitter can be a single character, another string, or a regular expression. After splitting the string into multiple substrings, the split() method puts them in an array and returns it.


2 Answers

If we change the regex to /:.+/ and perform a split on it you get:

["aaa", ""] 

This makes sense as the regex is matching the :bbb:ccc. And gives you the same output, if you were to manually split that string.

>>> 'aaa:bbb:ccc'.split(':bbb:ccc') ['aaa', ''] 

Adding the capture group in just saves the bbb:ccc, but shouldn't change the original split behaviour.

like image 94
Peilonrayz Avatar answered Sep 17 '22 22:09

Peilonrayz


Interesting. Learnt a lot from this question. Let me share what I learnt.

Dot doesn't match the new line

If we think about it, the intention is to split the string based on the : followed by one or more number of characters. If that is the case, the output should have been

['aaa', '\nbbb:ccc', ''] 

right? Because the .+ matches greedily. So, it should have split at :\nbbb:ccc, where : matches : and .+ matches \nbbb:ccc. But the actual output you got was

[ 'aaa:\nbbb', 'ccc', '' ] 

This is because, . does not match line terminators. Quoting MDN,

(The dot, the decimal point) matches any single character except line terminators: \n, \r, \u2028 or \u2029.

So, :\n doesn't match :(.+). That is why it doesn't break there. If you actually meant to match the new line as well, either use [^] or [\s\S].

For example,

console.log(data.split(/:([\s\S]+)/)); // [ 'aaa:\nbbb', 'ccc', '' ] console.log(data.split(/:([\s\S]+)/)); // [ 'aaa', '\nbbb:ccc', '' ] console.log(data.split(/:([^]+)/)); // [ 'aaa', '\nbbb:ccc', '' ] 

Now to answer your actual question, why there is an empty string at the end of splitting. When you cut a big line, how many lines do you get? Two small lines. So whenever you make a cut, there should be two objects. In your case, aaa:\nbbb is the first cut, the actual place the cut happend is :ccc, and since the string ends there, an empty string is included to indicate that the that is the end of the string.

like image 38
thefourtheye Avatar answered Sep 20 '22 22:09

thefourtheye