I have a really simple script right now that counts lines in a text file using enumerate()
:
i = 0 f = open("C:/Users/guest/Desktop/file.log", "r") for i, line in enumerate(f): pass print i + 1 f.close()
This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.
I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...
Thank you!
Use readlines() to get Line Count This is the most straightforward way to count the number of lines in a text file in Python. The readlines() method reads all lines from a file and stores it in a list. Next, use the len() function to find the length of the list which is nothing but total lines present in a file.
Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.
Use os. walk to traverse the files and sub directories, use endswith to filter the files you want to count, open each file and use sum(1 for line in f) to count the lines, aggregate all the file line counts.
Ignacio's answer is correct, but might fail if you have a 32 bit process.
But maybe it could be useful to read the file block-wise and then count the \n
characters in each block.
def blocks(files, size=65536): while True: b = files.read(size) if not b: break yield b with open("file", "r") as f: print sum(bl.count("\n") for bl in blocks(f))
will do your job.
Note that I don't open the file as binary, so the \r\n
will be converted to \n
, making the counting more reliable.
For Python 3, and to make it more robust, for reading files with all kinds of characters:
def blocks(files, size=65536): while True: b = files.read(size) if not b: break yield b with open("file", "r",encoding="utf-8",errors='ignore') as f: print (sum(bl.count("\n") for bl in blocks(f)))
I know its a bit unfair but you could do this
int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])
If you're on Windows, check out Coreutils.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With