Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python string splitlines() removes certain Unicode control characters

Tags:

python

unicode

I noticed that Python's standard string method splitlines() actually removes some crucial Unicode control characters as well. Example

>>> s1 = u'asdf \n fdsa \x1d asdf'
>>> s1.splitlines()
[u'asdf ', u' fdsa ', u' asdf']

Notice how the "\x1d" character quietly disappears.

It doesn't happen if the string s1 is still a Python bytestring though (without the "u" prefix):

>>> s2 = 'asdf \n fdsa \x1d asdf'
>>> s2.splitlines()
['asdf ', ' fdsa \x1d asdf']

I can't find any information about this in the reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines.

Why does this happen? What other characters than "\x1d" (or unichr(29)) are affected?

I'm using Python 2.7.3 on Ubuntu 12.04 LTS.

like image 532
Niklas9 Avatar asked Jun 27 '14 14:06

Niklas9


People also ask

What does the function Splitlines () do?

The splitlines() method splits a string into a list. The splitting is done at line breaks.

How do I remove a control character from a string in Python?

Explanation : \n, \0, \f, \r, \b, \t being control characters are removed from string.

How does Splitlines work in Python?

Python String splitlines() method is used to split the lines at line boundaries. The function returns a list of lines in the string, including the line break(optional). Parameters: keepends (optional): When set to True line breaks are included in the resulting list.

What does Splitlines return in Python?

The splitlines() method returns: a list of lines in the string.


1 Answers

This is indeed under-documented; I had to dig through the source code somewhat to find it.

The unicodetype_db.h file defines linebreaks as:

case 0x000A:
case 0x000B:
case 0x000C:
case 0x000D:
case 0x001C:
case 0x001D:
case 0x001E:
case 0x0085:
case 0x2028:
case 0x2029:

These are generated from the Unicode database; any codepoint listed in the Unicode standard with the Line_Break property set to BK, CR, LF or NL or with bidirectional category set to B (paragraph break) is considered a line break.

From the Unicode Data file, version 6 of the standard lists U+001D as a paragraph break:

001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;;

(5th column is the bidirectional category).

You could use a regular expression if you want to limit what characters to split on:

import re

linebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]')
linebreaks.split(yourtext)

would split your text on the same set of linebreaks except for the U+001C, U+001D or U+001E codepoints, so the three data structuring control characters.

like image 78
Martijn Pieters Avatar answered Oct 27 '22 01:10

Martijn Pieters