I'm operating with huge CSV files (20-25Mln rows) and don't want to split them into smaller pieces for a lot of reasons.
My script reads a file row by row using csv module. I need to now a position (byte number) of the line which will be read on the next iteration (or which just was read).
I tried
>>> import csv
>>> f = open("uscompany.csv","rU")
>>> reader = csv.reader(f)
>>> reader.next()
....
>>> f.tell()
8230
But it seems csv module reads the file by blocks. Since when I keep on iteration I get the same position
>>> reader.next()
....
>>> f.tell()
8230
Any suggestions? Please advice.
If by "byte position" you mean the byte position as if you had read the file in as a normal text file, then my suggestion is to do just that. Read in the file line by line as text, and get the position within the line that way. You can still parse the CSV data row by row yourself using the csv
module:
for line in myfile:
row = csv.reader([line]).next()
I think it is perfectly good design for the CSV reader to not provide a byte position of this kind, because it really doesn't make much sense in a CSV context. After all, "data"
and data
are the exact same four bytes of data as far as CSV is concerned, but the d
might be the 2nd byte or the 1st byte depending on whether the optional surrounding quotes were used.
Short answer: not possible. The byte position is not available through the csvreader API
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With