Simple regex problem: Removing all new lines from a file

Tags:

regex

I'm becoming acquainted with python and am creating problems in order to help myself learn the ins and outs of the language. My next problem comes as follows:

I have copied and pasted a huge slew of text from the internet, but the copy and paste added several new lines to break up the huge string. I wish to programatically remove all of these and return the string into a giant blob of characters. This is obviously a job for regex (I think), and parsing through the file and removing all instances of the newline character sounds like it would work, but it doesn't seem to be going over all that well for me.

Is there an easy way to go about this? It seems rather simple.

247

asked Aug 08 '09 19:08

Chris

2 Answers

The two main alternatives: read everything in as a single string and remove newlines:

clean = open('thefile.txt').read().replace('\n', '')

or, read line by line, removing the newline that ends each line, and join it up again:

clean = ''.join(l[:-1] for l in open('thefile.txt'))

The former alternative is probably faster, but, as always, I strongly recommend you MEASURE speed (e.g., use python -mtimeit) in cases of your specific interest, rather than just assuming you know how performance will be. REs are probably slower, but, again: don't guess, MEASURE!

So here are some numbers for a specific text file on my laptop:

$ python -mtimeit -s"import re" "re.sub('\n','',open('AV1611Bible.txt').read())"
10 loops, best of 3: 53.9 msec per loop
$ python -mtimeit "''.join(l[:-1] for l in open('AV1611Bible.txt'))"
10 loops, best of 3: 51.3 msec per loop
$ python -mtimeit "open('AV1611Bible.txt').read().replace('\n', '')"
10 loops, best of 3: 35.1 msec per loop

The file is a version of the KJ Bible, downloaded and unzipped from here (I do think it's important to run such measurements on one easily fetched file, so others can easily reproduce them!).

Of course, a few milliseconds more or less on a file of 4.3 MB, 34,000 lines, may not matter much to you one way or another; but as the fastest approach is also the simplest one (far from an unusual occurrence, especially in Python;-), I think that's a pretty good recommendation.

193

answered Oct 15 '22 01:10

Alex Martelli

I wouldn't use a regex for simply replacing newlines - I'd use string.replace(). Here's a complete script:

f = open('input.txt')
contents = f.read()
f.close()
new_contents = contents.replace('\n', '')
f = open('output.txt', 'w')
f.write(new_contents)
f.close()

answered Oct 15 '22 01:10

RichieHindle

Related questions
                            
                                Deploying Python Qt application cross-platform: Win, OSX, Linux
                            
                                pyramid + jinja2 and new GAE runtime
                            
                                glob "one or more" in Python
                            
                                Adding a user/accounts table to Postgres in Django View
                            
                                __init__.so (instead of __init__.py) masks subpackages
                            
                                Sublime Text Build System With Options
                            
                                Using Celery with SQLAlchemy and Pyramid
                            
                                tweepy/twitter get all tweets from a location:
                            
                                Overlay text on a picture with PIL
                            
                                How to assign RGB color values to grid mesh with matplotlib
                            
                                Filtering Image For Improving Text Recognition
                            
                                How to set values to np.nan with multiple conditions for series?
                            
                                When (and why) was Python `__new__()` introduced?
                            
                                How do I create character arrays in numpy?
                            
                                ImportError: No module named django_extensions
                            
                                Resampling irregularly spaced data to a regular grid in Python
                            
                                Install pywin32 with pip in Windows 7 does not work in python 3.4.2
                            
                                How to get the cells of a sudoku grid with OpenCV?
                            
                                Python Requests getting SSLerror
                            
                                PyCharm won't open matplotlib plots correctly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With