I have a big file(30GB+) csv where I am counting the number of newlines by parsing the file block by block
Using the below function
def yieldblocks(file, size=1024*1024):
while True:
blocks = file.read(size)
if not blocks: break
yield blocks
And calling it so,
sum(bl.count("\n") for bl in blocks(txtfile))
I am able to count the newlines in slightly under an hour( I am surprised thats the best I could get too )
My problem is I need to skip the new lines coming within double quotes as some columns have multiline content.
I tried the below but it does not seem to be working and process exits without results
sum(.5 if re.search('^[^"]*"(?=[^"]*(?:"[^"]*")*[^"]*$).*$', bl) else 1 for bl in yieldblocks(txtfile))
The regular expression is to find odd number of double quotes characters in one line and is working in a small sized file.
I am on a 2GB RAM, 32 bit OS
Note: I tried the CSV module but its slower compared to counting by blocks and was hoping I could somehow get this to work
This may work well for you. pandas.read_csv is generally very fast, but I haven't tried it chunk-wise.
import pandas as pd
reader = pd.read_csv('file.csv', sep=',', chunksize=10000, low_memory=True)
line_count = sum(len(chunk) for chunk in reader)
There is a little more info (not much) in the documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With