When splitting an empty string in Python, why does split() return an empty list while split('\n') returns ['']?

People also ask

Why does split return empty string Python?

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list. In contrast, the second mode (with an argument such as \n ) will produce the first empty field.

Does split in Python return a list?

The split() method in Python returns a list of the words in the string/line , separated by the delimiter string. This method will return one or more new strings. All substrings are returned in the list datatype.

When you split a string Does it become a list?

Definition and Usage. The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.

What does split () function returns in Python?

The string manipulation function in Python used to break down a bigger string into several smaller strings is called the split() function in Python. The split() function returns the strings as a list.

Question: I am using split('\n') to get lines in one string, and found that ''.split() returns an empty list, [], while ''.split('\n') returns [''].

The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

In contrast, the second mode (with an argument such as \n) will produce the first empty field. Consider if you had written '\n'.split('\n'), you would get two fields (one split, gives you two halves).

Question: Is there any specific reason for such a difference?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print(line.split())

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print(line.split(','))

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python's str.split(delimiter) method:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

Question: And is there any more convenient way to count lines in a string?

Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the \n. If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn't "terrible" either. For the most part, Guido's API design choices have stood the test of time.

The current API is not without advantages. Consider strings such as:

ps_aux_header  = 'USER               PID  %CPU %MEM      VSZ'
patient_header = 'name,age,height,weight'

When asked to break these strings into fields, people tend to describe both using the same English word, "split". When asked to read code such as fields = line.split() or fields = line.split(','), people tend to correctly interpret the statements as "splits a line into fields".

Microsoft Excel's text-to-columns tool made a similar API choice and incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.

It seems to simply be the way it's supposed to work, according to the documentation:

Splitting an empty string with a specified separator returns [''].

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

So, to make it clearer, the split() function implements two different splitting algorithms, and uses the presence of an argument to decide which one to run. This might be because it allows optimizing the one for no arguments more than the one with arguments; I don't know.

.split() without parameters tries to be clever. It splits on any whitespace, tabs, spaces, line feeds etc, and it also skips all empty strings as a result of this.

>>> "  fii    fbar \n bopp ".split()
['fii', 'fbar', 'bopp']

Essentially, .split() without parameters are used to extract words from a string, as opposed to .split() with parameters which just takes a string and splits it.

That's the reason for the difference.

And yeah, counting lines by splitting is not an efficient way. Count the number of line feeds, and add one if the string doesn't end with a line feed.

Use count():

s = "Line 1\nLine2\nLine3"
n_lines = s.count('\n') + 1

>>> print str.split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

Note the last sentence.

To count lines you can simply count how many \n are there:

line_count = some_string.count('\n') + some_string[-1] != '\n'

The last part takes into account the last line that do not end with \n, even though this means that Hello, World! and Hello, World!\n have the same line count(which for me is reasonable), otherwise you can simply add 1 to the count of \n.

Related questions
                            
                                Converting a list to a set changes element order
                            
                                AttributeError("'str' object has no attribute 'read'")
                            
                                Multiple ModelAdmins/views for same model in Django admin
                            
                                How to merge dictionaries of dictionaries?
                            
                                Parsing XML with namespace in Python via 'ElementTree'
                            
                                ipython notebook clear cell output in code
                            
                                Get last result in interactive Python shell
                            
                                How to form tuple column from two columns in Pandas
                            
                                Find and replace string values in list
                            
                                How do we determine the number of days for a given month in python [duplicate]
                            
                                Django Admin - Disable the 'Add' action for a specific model
                            
                                Using numpy to build an array of all combinations of two arrays
                            
                                Tensorflow 2.0 - AttributeError: module 'tensorflow' has no attribute 'Session'
                            
                                Why does Python pep-8 strongly recommend spaces over tabs for indentation? [closed]
                            
                                Generate a random letter in Python
                            
                                Convert from ASCII string encoded in Hex to plain ASCII?
                            
                                Getting number of elements in an iterator in Python
                            
                                Python mysqldb: Library not loaded: libmysqlclient.18.dylib
                            
                                How can I share Jupyter notebooks with non-programmers? [closed]
                            
                                Matplotlib scatterplot; color as a function of a third variable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When splitting an empty string in Python, why does split() return an empty list while split('\n') returns ['']?

Tags:

python

string

algorithm

split

parsing

People also ask

Recent Activity

Donate For Us