Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(Python) Counting lines in a huge (>10GB) file as fast as possible [duplicate]

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0 f = open("C:/Users/guest/Desktop/file.log", "r") for i, line in enumerate(f):       pass print i + 1 f.close() 

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

Thank you!

like image 414
Adrienne Avatar asked Mar 09 '12 05:03

Adrienne


People also ask

How do I count the number of lines in a file in Python?

Use readlines() to get Line Count This is the most straightforward way to count the number of lines in a text file in Python. The readlines() method reads all lines from a file and stores it in a list. Next, use the len() function to find the length of the list which is nothing but total lines present in a file.

How does Python handle large files?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.

How do you count lines of code in Python?

Use os. walk to traverse the files and sub directories, use endswith to filter the files you want to count, open each file and use sum(1 for line in f) to count the lines, aggregate all the file line counts.


2 Answers

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):     while True:         b = files.read(size)         if not b: break         yield b  with open("file", "r") as f:     print sum(bl.count("\n") for bl in blocks(f)) 

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):     while True:         b = files.read(size)         if not b: break         yield b  with open("file", "r",encoding="utf-8",errors='ignore') as f:     print (sum(bl.count("\n") for bl in blocks(f))) 
like image 175
glglgl Avatar answered Sep 29 '22 03:09

glglgl


I know its a bit unfair but you could do this

int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0]) 

If you're on Windows, check out Coreutils.

like image 43
Jakob Bowyer Avatar answered Sep 29 '22 02:09

Jakob Bowyer