Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is python str.split() inconsistent?

>>> ".a string".split('.')
['', 'a string']

>>> "a .string".split('.')
['a ', 'string']

>>> "a string.".split('.')
['a string', '']

>>> "a ... string".split('.')
['a ', '', '', ' string']

>>> "a ..string".split('.')
['a ', '', 'string']

>>> 'this  is a test'.split(' ')
['this', '', 'is', 'a', 'test']

>>> 'this  is a test'.split()
['this', 'is', 'a', 'test']

Why is split() different from split(' ') when the invoked string only have spaces as whitespaces?

Why split('.') splits "..." to ['','']? split() does not consider an empty word between 2 separators...

The docs are clear about this (see @agf below), but I'd like to know why is this the chosen behaviour.

I have looked in the source code (here) and thought line 136 should be just less than: ...i < str_len...

like image 252
ijverig Avatar asked Oct 19 '25 09:10

ijverig


1 Answers

See the str.split docs, this behavior is specifically mentioned:

If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']). The sep argument may consist of multiple characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a specified separator returns [''].

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

Python tries to do what you would expect. Most people not thinking too hard would probably expect

'1 2 3 4 '.split() 

to return

['1', '2', '3', '4']

Think about splitting data where spaces have been used instead of tabs to create fixed-width columns -- if the data is different widths, there will be different number of spaces in each row.

There is often trailing whitespace at the end of a line that you can't see, and the default ignores it as well -- it gives you the answer you'd visually expect.

When it comes to the algorithm used when a delimiter is specified, think about a row in a CSV file:

1,,3

means there is data in the 1st and 3rd columns, and none in the second, so you would want

'1,,3'.split(',')

to return

['1', '', '3']

otherwise you wouldn't be able to tell what column each string came from.

like image 128
agf Avatar answered Oct 21 '25 22:10

agf



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!