What's the most pythonic way of normalizing lineends in a string?

Tags:

Given a text-string of unknown source, how does one best rewrite it to have a known lineend-convention?

I usually do:

lines = text.splitlines()
text = '\n'.join(lines)

... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!).

Edit

The oneliner of what I'm doing is of course:

'\n'.join(text.splitlines())

... that's not what I'm asking about.

The total number of lines should be the same afterwards, so no stripping of empty lines.

Testcases

Splitting

'a\nb\n\nc\nd'
'a\r\nb\r\n\r\nc\r\nd'
'a\rb\r\rc\rd'
'a\rb\n\rc\rd'
'a\rb\r\nc\nd'
'a\nb\r\nc\rd'

.. should all result in 5 lines. In a mixed context, splitlines assumes that '\r\n' is a single logical newline, leading to 4 lines for the last two testcases.

Hm, a mixed context that contains '\r\n' can be detected by comparing the result of splitlines() and split('\n'), and/or split('\r')...

751

asked Nov 17 '09 15:11

kaleissin

2 Answers

mixed.replace('\r\n', '\n').replace('\r', '\n')

should handle all possible variants.

195

answered Sep 28 '22 01:09

dottedmag

... but this doesn't handle "mixed" text-files of utterly confused conventions (Yes, they still exist!)

Actually it should work fine:

>>> s = 'hello world\nline 1\r\nline 2'

>>> s.splitlines()
['hello world', 'line 1', 'line 2']

>>> '\n'.join(s.splitlines())
'hello world\nline 1\nline 2'

What version of Python are you using?

EDIT: I still don't see how splitlines() is not working for you:

>>> s = '''\
... First line, with LF\n\
... Second line, with CR\r\
... Third line, with CRLF\r\n\
... Two blank lines with LFs\n\
... \n\
... \n\
... Two blank lines with CRs\r\
... \r\
... \r\
... Two blank lines with CRLFs\r\n\
... \r\n\
... \r\n\
... Three blank lines with a jumble of things:\r\n\
... \r\
... \r\n\
... \n\
... End without a newline.'''

>>> s.splitlines()
['First line, with LF', 'Second line, with CR', 'Third line, with CRLF', 'Two blank lines with LFs', '', '', 'Two blank lines with CRs', '', '', 'Two blank lines with CRLFs', '', '', 'Three blank lines with a jumble of things:', '', '', '', 'End without a newline.']

>>> print '\n'.join(s.splitlines())
First line, with LF
Second line, with CR
Third line, with CRLF
Two blank lines with LFs


Two blank lines with CRs


Two blank lines with CRLFs


Three blank lines with a jumble of things:



End without a newline.

As far as I know splitlines() doesn't split the list twice or anything.

Can you paste a sample of the kind of input that's giving you trouble?

answered Sep 28 '22 00:09

Steve Losh

Related questions
                            
                                Pandas - How to extract HH:MM from datetime column in Python?
                            
                                Return a Pandas DataFrame as a data_table from a callback with Plotly Dash for Python
                            
                                Python TypeError: sort() takes no positional arguments
                            
                                No module named 'cv2.cv2'
                            
                                Cycle over list indefinitely
                            
                                airflow webserver started but UI doesn't show in browser
                            
                                What does Import Error: Symbol not found: _PQencryptPasswordConn mean and how do I fix it?
                            
                                Install python 2.7 on ubuntu 18.04
                            
                                WARNING: Failed to generate report: No data to report error in python using pytest module
                            
                                How to count consecutive repetitions of a substring in a string?
                            
                                How do you save a Tensorflow dataset to a file?
                            
                                Pandas combining sparse columns in dataframe
                            
                                Timestamp object has no attribute dt
                            
                                Is there a Python equivalent to template literals in JavaScript?
                            
                                Formatter black is not working on my VSCode...but why?
                            
                                Efficient way of making time increment strings?
                            
                                What would you recommend for a high traffic ajax intensive website?
                            
                                Which library should I use to write an XLS from Linux / Python?
                            
                                How to initialize variables to None/Undefined and compare to other variables in Python?
                            
                                Constructors in Python [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the most pythonic way of normalizing lineends in a string?

Tags:

python

newline

line-breaks