(Python) Counting lines in a huge (>10GB) file as fast as possible [duplicate]

Tags:

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0 f = open("C:/Users/guest/Desktop/file.log", "r") for i, line in enumerate(f):       pass print i + 1 f.close()

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

Thank you!

414

asked Mar 09 '12 05:03

Adrienne

2 Answers

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):     while True:         b = files.read(size)         if not b: break         yield b  with open("file", "r") as f:     print sum(bl.count("\n") for bl in blocks(f))

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):     while True:         b = files.read(size)         if not b: break         yield b  with open("file", "r",encoding="utf-8",errors='ignore') as f:     print (sum(bl.count("\n") for bl in blocks(f)))

175

answered Sep 29 '22 03:09

glglgl

I know its a bit unfair but you could do this

int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])

If you're on Windows, check out Coreutils.

answered Sep 29 '22 02:09

Jakob Bowyer

Related questions
                            
                                How to get key value in django template?
                            
                                Converting .jpg images to .png
                            
                                Got continuous is not supported error in RandomForestRegressor
                            
                                Why is this simple conditional expression not working? [duplicate]
                            
                                Test if a python string is printable
                            
                                What is the simplest way to create an empty iterable using yield in Python?
                            
                                plotting value_counts() in seaborn barplot
                            
                                Python: UserWarning: This pattern has match groups. To actually get the groups, use str.extract
                            
                                Replace comma with dot Pandas
                            
                                Failed to find data adapter that can handle input: <class 'numpy.ndarray'>, (<class 'list'> containing values of types {"<class 'int'>"})
                            
                                Fitting a gamma distribution with (python) Scipy
                            
                                How to finish sys.stdin.readlines() input?
                            
                                Python recursively replace character in keys of nested dictionary?
                            
                                pip broken after upgrading
                            
                                How to convert singleton array to a scalar value in Python?
                            
                                Plot pie chart and table of pandas dataframe
                            
                                How to make the angles in a matplotlib polar plot go clockwise with 0° at the top?
                            
                                Creating Signed URLs for Amazon CloudFront
                            
                                Matching partial ids in BeautifulSoup
                            
                                Extracting only characters from a string in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

(Python) Counting lines in a huge (>10GB) file as fast as possible [duplicate]

Tags:

python

line-count

enumerate

Adrienne

People also ask

2 Answers

glglgl

Jakob Bowyer

Recent Activity

Donate For Us