I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map
, like this:
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map
, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?
The nearest equivalent to the tcl string map
would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the \r
's have to be replaced first:
import re
oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
\tThis would keep paragraphs separated.
\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
output:
This would keep paragraphs separated. This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
When, in the course of human events, it becomes necessary for one people
I doubt whether this is as efficient as the tcl code, though.
UPDATE:
I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:
set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
puts $newtext
and my python equivalent:
import re
with open('gutenberg.txt') as stream:
oldtext = stream.read()
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
print(newtext)
Crude performance test:
$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30
So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).
You can use a regular expression with a look-ahead search:
import re
text = """
...
"""
newtext = re.sub(r"\n(?=[^\n\t])", " ", text)
That will replace any new line that is not followed by a newline or a tab with a space.
I use the following script when I want to do this:
import sys
import os
filename, extension = os.path.splitext(sys.argv[1])
with open(filename+extension, encoding='utf-8-sig') as (file
), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
):
*lines, last = list(file)
for line in lines:
if line == "\n":
line = "\n\n"
elif line[0] == "\t":
line = "\n" + line[:-1] + " "
else:
line = line[:-1] + " "
output.write(line)
output.write(last)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With