Most efficient way to delete needless newlines in Python

Question

I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map, like this:

set newtext [string map "{
} {} {

} {

} {
	} {
	} {
} { }" $oldtext]

This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?

ekhumoro · Accepted Answer

The nearest equivalent to the tcl string map would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the 's have to be replaced first:

import re

oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.

This would keep paragraphs separated.
	This would keep paragraphs separated.


When, in the course
of human events,
it becomes necessary

for one people
"""

newtext = re.sub(r'(?<!
)
(?![
	])', ' ', oldtext.replace('
', ''))

output:

This would keep paragraphs separated. This would keep paragraphs separated.

This would keep paragraphs separated.
    This would keep paragraphs separated.

When, in the course of human events, it becomes necessary for one people

I doubt whether this is as efficient as the tcl code, though.

UPDATE:

I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:

set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp

set newtext [string map "{
} {} {

} {

} {
	} {
	} {
} { }" $oldtext]

puts $newtext

and my python equivalent:

import re

with open('gutenberg.txt') as stream:
    oldtext = stream.read()

    newtext = re.sub(r'(?<!
)
(?![
	])', ' ', oldtext.replace('
', ''))

    print(newtext)

Crude performance test:

$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30

So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).

zondo · Answer

You can use a regular expression with a look-ahead search:

import re

text = """
...
"""

newtext = re.sub(r"
(?=[^
	])", " ", text)

That will replace any new line that is not followed by a newline or a tab with a space.

TigerhawkT3 · Answer

I use the following script when I want to do this:

import sys
import os

filename, extension = os.path.splitext(sys.argv[1])

with open(filename+extension, encoding='utf-8-sig') as (file
  ), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
  ):
    *lines, last = list(file)
    for line in lines:
        if line == "
":
            line = "

"
        elif line[0] == "	":
            line = "
" + line[:-1] + " "
        else:
            line = line[:-1] + " "
        output.write(line)
    output.write(last)

A "blank" line, with only a linefeed, turns into two linefeeds (to replace the one removed from the previous line). This handles files that separate paragraphs with two linefeeds.
A line beginning with a tab gets a leading linefeed (to replace the one removed from the previous line) and gets its trailing linefeed replaced with a space. This handles files that separate paragraphs with a tab character.
A line that is neither blank nor beginning with a tab gets its trailing linefeed replace with a space.
The last line in the file may not have a trailing linefeed and therefore gets copied directly.

Most efficient way to delete needless newlines in Python

Tags:

python

McClamrock

3 Answers

ekhumoro

zondo

TigerhawkT3

Recent Activity

Donate For Us

Most efficient way to delete needless newlines in Python

Tags:

python

McClamrock

3 Answers

ekhumoro

zondo

TigerhawkT3

Related questions

Recent Activity

Donate For Us