I noticed that Python's standard string method splitlines() actually removes some crucial Unicode control characters as well. Example
>>> s1 = u'asdf \n fdsa \x1d asdf'
>>> s1.splitlines()
[u'asdf ', u' fdsa ', u' asdf']
Notice how the "\x1d" character quietly disappears.
It doesn't happen if the string s1 is still a Python bytestring though (without the "u" prefix):
>>> s2 = 'asdf \n fdsa \x1d asdf'
>>> s2.splitlines()
['asdf ', ' fdsa \x1d asdf']
I can't find any information about this in the reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines.
Why does this happen? What other characters than "\x1d" (or unichr(29)) are affected?
I'm using Python 2.7.3 on Ubuntu 12.04 LTS.
The splitlines() method splits a string into a list. The splitting is done at line breaks.
Explanation : \n, \0, \f, \r, \b, \t being control characters are removed from string.
Python String splitlines() method is used to split the lines at line boundaries. The function returns a list of lines in the string, including the line break(optional). Parameters: keepends (optional): When set to True line breaks are included in the resulting list.
The splitlines() method returns: a list of lines in the string.
This is indeed under-documented; I had to dig through the source code somewhat to find it.
The unicodetype_db.h
file defines linebreaks as:
case 0x000A:
case 0x000B:
case 0x000C:
case 0x000D:
case 0x001C:
case 0x001D:
case 0x001E:
case 0x0085:
case 0x2028:
case 0x2029:
These are generated from the Unicode database; any codepoint listed in the Unicode standard with the Line_Break
property set to BK
, CR
, LF
or NL
or with bidirectional category set to B
(paragraph break) is considered a line break.
From the Unicode Data file, version 6 of the standard lists U+001D as a paragraph break:
001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;;
(5th column is the bidirectional category).
You could use a regular expression if you want to limit what characters to split on:
import re
linebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]')
linebreaks.split(yourtext)
would split your text on the same set of linebreaks except for the U+001C, U+001D or U+001E codepoints, so the three data structuring control characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With