Process very large (>20GB) text file line by line

Tags:

line

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the last three characters from each of the first three fields - which should reduce the file size by about 20%.

I am brand new to Python and have a code which will do what I want to do at about 3.4 GB per hour, but to be a worthwhile exercise I really need to be getting at least 10 GB/hr - is there any way to speed this up? This code doesn't come close to challenging my processor, so I am making an uneducated guess that it is limited by the read and write speed to the internal hard drive?

def ProcessLargeTextFile():     r = open("filepath", "r")     w = open("filepath", "w")     l = r.readline()     while l:         x = l.split(' ')[0]         y = l.split(' ')[1]         z = l.split(' ')[2]         w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))         l = r.readline()     r.close()     w.close()

Any help would be really appreciated. I am using the IDLE Python GUI on Windows 7 and have 16GB of memory - perhaps a different OS would be more efficient?.

Edit: Here is an extract of the file to be processed.

70700.642014 31207.277115 -0.054123 -1585 255 255 255 70512.301468 31227.990799 -0.255600 -1655 155 158 158 70515.727097 31223.828659 -0.066727 -1734 191 187 180 70566.756699 31217.065598 -0.205673 -1727 254 255 255 70566.695938 31218.030807 -0.047928 -1689 249 251 249 70536.117874 31227.837662 -0.033096 -1548 251 252 252 70536.773270 31212.970322 -0.115891 -1434 155 158 163 70533.530777 31215.270828 -0.154770 -1550 148 152 156 70533.555923 31215.341599 -0.138809 -1480 150 154 158

232

asked May 21 '13 11:05

Tom_b

2 Answers

It's more idiomatic to write your code like this

def ProcessLargeTextFile():     with open("filepath", "r") as r, open("outfilepath", "w") as w:         for line in r:             x, y, z = line.split(' ')[:3]             w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

def ProcessLargeTextFile():     bunchsize = 1000000     # Experiment with different sizes     bunch = []     with open("filepath", "r") as r, open("outfilepath", "w") as w:         for line in r:             x, y, z = line.split(' ')[:3]             bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))             if len(bunch) == bunchsize:                 w.writelines(bunch)                 bunch = []         w.writelines(bunch)

suggested by @Janne, an alternative way to generate the lines

def ProcessLargeTextFile():     bunchsize = 1000000     # Experiment with different sizes     bunch = []     with open("filepath", "r") as r, open("outfilepath", "w") as w:         for line in r:             x, y, z, rest = line.split(' ', 3)             bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))             if len(bunch) == bunchsize:                 w.writelines(bunch)                 bunch = []         w.writelines(bunch)

127

answered Sep 27 '22 18:09

John La Rooy

Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:

Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
Some async io library (Twisted?) might help too.

If you figured out the exact problem, ask again for optimizations of that problem.

answered Sep 27 '22 18:09

Achim

Related questions
                            
                                How to print UTF-8 encoded text to the console in Python < 3?
                            
                                How to write to stdout AND to log file simultaneously with Popen?
                            
                                What is the safest way to removing Python framework files that are located in different place than Brew installs
                            
                                Python interface for R Programming Language [duplicate]
                            
                                Does `anaconda` create a separate PYTHONPATH variable for each new environment?
                            
                                Correct way to set new column in pandas DataFrame to avoid SettingWithCopyWarning
                            
                                How do you access an authenticated Google App Engine service from a (non-web) python client?
                            
                                Why does pip freeze report some packages in a fresh virtualenv created with --no-site-packages?
                            
                                Can you perform multi-threaded tasks within Django?
                            
                                How do I transpose dataframe in pandas without index?
                            
                                What does the "yield from" syntax do in asyncio and how is it different from "await"
                            
                                Tab completion in Python's raw_input()
                            
                                Big-O of list slicing
                            
                                What does Django's @property do?
                            
                                Simplest way of checking for string that contains a string in list? [duplicate]
                            
                                Cross-correlation (time-lag-correlation) with pandas?
                            
                                How to apply "first" and "last" functions to columns while using group by in pandas?
                            
                                Python pytz timezone function returns a timezone that is off by 9 minutes
                            
                                Can Cython compile to an EXE?
                            
                                Python how to read N number of lines at a time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With