Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which characters are considered whitespace by split()?

I am porting some Python 2 code that calls split() on strings, so I need to know its exact behavior. The documentation states that when you do not specify the sep argument, "runs of consecutive whitespace are regarded as a single separator".

Unfortunately, it does not specify which characters that would be. There are some obvious contenders (like space, tab, and newline), but Unicode contains plenty of other candidates.

Which characters are considered to be whitespace by split()?

Since the answer might be implementation-specific, I'm targeting CPython.

(Note: I researched the answer to this myself since I couldn't find it anywhere, so I'll be posting it here, hopefully for the benefit of others.)

like image 226
Aasmund Eldhuset Avatar asked May 02 '20 21:05

Aasmund Eldhuset


People also ask

What characters count as whitespace?

Space, tab, line feed (newline), carriage return, form feed, and vertical tab characters are called "white-space characters" because they serve the same purpose as the spaces between words and lines on a printed page — they make reading easier.

What is a whitespace character example?

When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts.

What does the split () method return from a list of words?

Python string method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.

How to split a string on whitespace characters in Java?

There are several ways to split a string on whitespace characters: 1. Using String.split () method The standard solution to split a string is using the split () method provided by the String class. It accepts a regular expression as a delimiter and returns a string array.

How many characters are considered whitespace?

Unfortunately, which characters are considered whitespace are totally dependent on the character set being used. As a result, we’ll simplify this problem by only concerning ourselves with Unicode characters (as of the publish date). In the Unicode character set, there are 17 “separator, space” characters.

Which list would result in a line break split by whitespace?

With the addition of the line break, we would expect that splitting by whitespace would result in the following list: ["Hi,", "Ben!", "How", "are", "you?"] ["Hi,", "Ben!", "How", "are", "you?"]

How does the whitespace algorithm work with separators?

In the documentation, this is described as a “different algorithm” from the default behavior. In other words, the whitespace algorithm will treat consecutive whitespace characters as a single entity. Meanwhile, if a separator is provided, the method splits at every occurrence of that separator:


Video Answer


2 Answers

Unfortunately, it depends on whether your string is an str or a unicode (at least, in CPython - I don't know whether this behavior is actually mandated by a specification anywhere).

If it is an str, the answer is straightforward:

  • 0x09 Tab
  • 0x0a Newline
  • 0x0b Vertical Tab
  • 0x0c Form Feed
  • 0x0d Carriage Return
  • 0x20 Space

Source: these are the characters with PY_CTF_SPACE in Python/pyctype.c, which are used by Py_ISSPACE, which is used by STRINGLIB_ISSPACE, which is used by split_whitespace.

If it is a unicode, there are 29 characters, which in addition to the above are:

  • U+001c through 0x001f: File/Group/Record/Unit Separator
  • U+0085: Next Line
  • U+00a0: Non-Breaking Space
  • U+1680: Ogham Space Mark
  • U+2000 through 0x200a: various fixed-size spaces (e.g. Em Space), but note that Zero-Width Space is not included
  • U+2028: Line Separator
  • U+2029: Paragraph Separator
  • U+202f: Narrow No-Break Space
  • U+205f: Medium Mathematical Space
  • U+3000: Ideographic Space

Note that the first four are also valid ASCII characters, which means that an ASCII-only string might split differently depending on whether it is an str or a unicode!

Source: these are the characters listed in _PyUnicode_IsWhitespace, which is used by Py_UNICODE_ISSPACE, which is used by STRINGLIB_ISSPACE (it looks like they use the same function implementations for both str and unicode, but compile it separately for each type, with certain macros implemented differently). The docstring describes this set of characters as follows:

Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs'

like image 82
Aasmund Eldhuset Avatar answered Oct 12 '22 17:10

Aasmund Eldhuset


The answer by Aasmund Eldhuset is what I was attempting to do but I was beaten to the punch. It shows a lot of research and should definitely be the accepted answer.

If you want confirmation of that answer (or just want to test it in a different implementation, such as a non-CPython one, or a later one which may use a different Unicode standard under the covers), the following short program will print out the actual characters that cause a split when using .split() with no arguments.

It does this by constructing a string with the a and b characters(a) separated by the character being tested, then detecting if split creates an array more than one element:

int_ch = 0
while True:
    try:
        test_str = "a" + chr(int_ch) + "b"
    except Exception as e:
        print(f'Stopping, {e}')
        break
    if len(test_str.split()) != 1:
        print(f'0x{int_ch:06x} ({int_ch})')
    int_ch += 1

The output (for my system) is as follows:

0x000009 (9)
0x00000a (10)
0x00000b (11)
0x00000c (12)
0x00000d (13)
0x00001c (28)
0x00001d (29)
0x00001e (30)
0x00001f (31)
0x000020 (32)
0x000085 (133)
0x0000a0 (160)
0x001680 (5760)
0x002000 (8192)
0x002001 (8193)
0x002002 (8194)
0x002003 (8195)
0x002004 (8196)
0x002005 (8197)
0x002006 (8198)
0x002007 (8199)
0x002008 (8200)
0x002009 (8201)
0x00200a (8202)
0x002028 (8232)
0x002029 (8233)
0x00202f (8239)
0x00205f (8287)
0x003000 (12288)
Stopping, chr() arg not in range(0x110000)

You can ignore the error at the end, that's just to confirm it doesn't fail until we've moved out of the valid Unicode area (code points 0x000000 - 0x10ffff making up the seventeen planes).


(a) I'm hoping that no future version of Python ever considers a or b to be whitespace, as that would totally break this (and a lot of other) code.

I think the chances of that are rather slim, so it should be fine :-)

like image 27
paxdiablo Avatar answered Oct 12 '22 15:10

paxdiablo