Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to delete needless newlines in Python

Tags:

python

I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map, like this:

set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]

This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?

like image 735
McClamrock Avatar asked Mar 26 '16 23:03

McClamrock


3 Answers

The nearest equivalent to the tcl string map would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the \r's have to be replaced first:

import re

oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.

This would keep paragraphs separated.
\tThis would keep paragraphs separated.

\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""

newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))

output:

This would keep paragraphs separated. This would keep paragraphs separated.

This would keep paragraphs separated.
    This would keep paragraphs separated.

When, in the course of human events, it becomes necessary for one people

I doubt whether this is as efficient as the tcl code, though.

UPDATE:

I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:

set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp

set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]

puts $newtext

and my python equivalent:

import re

with open('gutenberg.txt') as stream:
    oldtext = stream.read()

    newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))

    print(newtext)

Crude performance test:

$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30

So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).

like image 166
ekhumoro Avatar answered Sep 27 '22 16:09

ekhumoro


You can use a regular expression with a look-ahead search:

import re

text = """
...
"""

newtext = re.sub(r"\n(?=[^\n\t])", " ", text)

That will replace any new line that is not followed by a newline or a tab with a space.

like image 21
zondo Avatar answered Sep 27 '22 15:09

zondo


I use the following script when I want to do this:

import sys
import os

filename, extension = os.path.splitext(sys.argv[1])

with open(filename+extension, encoding='utf-8-sig') as (file
  ), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
  ):
    *lines, last = list(file)
    for line in lines:
        if line == "\n":
            line = "\n\n"
        elif line[0] == "\t":
            line = "\n" + line[:-1] + " "
        else:
            line = line[:-1] + " "
        output.write(line)
    output.write(last)
  • A "blank" line, with only a linefeed, turns into two linefeeds (to replace the one removed from the previous line). This handles files that separate paragraphs with two linefeeds.
  • A line beginning with a tab gets a leading linefeed (to replace the one removed from the previous line) and gets its trailing linefeed replaced with a space. This handles files that separate paragraphs with a tab character.
  • A line that is neither blank nor beginning with a tab gets its trailing linefeed replace with a space.
  • The last line in the file may not have a trailing linefeed and therefore gets copied directly.
like image 36
TigerhawkT3 Avatar answered Sep 27 '22 15:09

TigerhawkT3