I noticed that Python's standard string method splitlines() actually removes some crucial Unicode control characters as well. Example <pre class="prettyprint"><code>>>> s1 = u'asdf \n fdsa \x1d asdf' >>> s1.splitlines() [u'asdf ', u' fdsa ', u' asdf'] </code></pre> Notice how the "\x1d" character quietly disappears. It doesn't happen if the string s1 is still a Python bytestring though (without the "u" prefix): <pre class="prettyprint"><code>>>> s2 = 'asdf \n fdsa \x1d asdf' >>> s2.splitlines() ['asdf ', ' fdsa \x1d asdf'] </code></pre> I can't find any information about this in the reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines. Why does this happen? What other characters than "\x1d" (or unichr(29)) are affected? I'm using Python 2.7.3 on Ubuntu 12.04 LTS.

This is indeed under-documented; I had to dig through the source code somewhat to find it. The <code>unicodetype_db.h</code> file defines linebreaks as: <pre class="prettyprint"><code>case 0x000A: case 0x000B: case 0x000C: case 0x000D: case 0x001C: case 0x001D: case 0x001E: case 0x0085: case 0x2028: case 0x2029: </code></pre> These are generated from the Unicode database; any codepoint listed in the Unicode standard with the <code>Line_Break</code> property set to <code>BK</code>, <code>CR</code>, <code>LF</code> or <code>NL</code> or with bidirectional category set to <code>B</code> (paragraph break) is considered a line break. From the Unicode Data file, version 6 of the standard lists U+001D as a paragraph break: <pre class="prettyprint"><code>001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;; </code></pre> (5th column is the bidirectional category). You could use a regular expression if you want to limit what characters to split on: <pre class="prettyprint"><code>import re linebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]') linebreaks.split(yourtext) </code></pre> would split your text on the same set of linebreaks except for the U+001C, U+001D or U+001E codepoints, so the three data structuring control characters.

Python string splitlines() removes certain Unicode control characters

I noticed that Python's standard string method splitlines() actually removes some crucial Unicode control characters as well. Example

>>> s1 = u'asdf \n fdsa \x1d asdf'
>>> s1.splitlines()
[u'asdf ', u' fdsa ', u' asdf']

Notice how the "\x1d" character quietly disappears.

It doesn't happen if the string s1 is still a Python bytestring though (without the "u" prefix):

>>> s2 = 'asdf \n fdsa \x1d asdf'
>>> s2.splitlines()
['asdf ', ' fdsa \x1d asdf']

I can't find any information about this in the reference https://docs.python.org/2.7/library/stdtypes.html#str.splitlines.

Why does this happen? What other characters than "\x1d" (or unichr(29)) are affected?

I'm using Python 2.7.3 on Ubuntu 12.04 LTS.

What does the function Splitlines () do?

The splitlines() method splits a string into a list. The splitting is done at line breaks.

How do I remove a control character from a string in Python?

Explanation : \n, \0, \f, \r, \b, \t being control characters are removed from string.

How does Splitlines work in Python?

Python String splitlines() method is used to split the lines at line boundaries. The function returns a list of lines in the string, including the line break(optional). Parameters: keepends (optional): When set to True line breaks are included in the resulting list.

What does Splitlines return in Python?

The splitlines() method returns: a list of lines in the string.

This is indeed under-documented; I had to dig through the source code somewhat to find it.

The unicodetype_db.h file defines linebreaks as:

case 0x000A:
case 0x000B:
case 0x000C:
case 0x000D:
case 0x001C:
case 0x001D:
case 0x001E:
case 0x0085:
case 0x2028:
case 0x2029:

These are generated from the Unicode database; any codepoint listed in the Unicode standard with the Line_Break property set to BK, CR, LF or NL or with bidirectional category set to B (paragraph break) is considered a line break.

From the Unicode Data file, version 6 of the standard lists U+001D as a paragraph break:

001D;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR THREE;;;;

(5th column is the bidirectional category).

You could use a regular expression if you want to limit what characters to split on:

import re

linebreaks = re.compile(ur'[\n-\r\x85\u2028\u2929]')
linebreaks.split(yourtext)

would split your text on the same set of linebreaks except for the U+001C, U+001D or U+001E codepoints, so the three data structuring control characters.

Python string splitlines() removes certain Unicode control characters

Tags:

python

unicode

Niklas9

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

Python string splitlines() removes certain Unicode control characters

Tags:

python

unicode

Niklas9

People also ask

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us