Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to know the byte position of a row of a CSV file in python?

Tags:

python

file

csv

I'm operating with huge CSV files (20-25Mln rows) and don't want to split them into smaller pieces for a lot of reasons.

My script reads a file row by row using csv module. I need to now a position (byte number) of the line which will be read on the next iteration (or which just was read).

I tried

>>> import csv
>>> f = open("uscompany.csv","rU")
>>> reader = csv.reader(f)
>>> reader.next()
....
>>> f.tell()
8230

But it seems csv module reads the file by blocks. Since when I keep on iteration I get the same position

>>> reader.next()
....
>>> f.tell()
8230

Any suggestions? Please advice.

like image 670
Maksym Polshcha Avatar asked Aug 24 '12 12:08

Maksym Polshcha


2 Answers

If by "byte position" you mean the byte position as if you had read the file in as a normal text file, then my suggestion is to do just that. Read in the file line by line as text, and get the position within the line that way. You can still parse the CSV data row by row yourself using the csv module:

for line in myfile:
  row = csv.reader([line]).next()

I think it is perfectly good design for the CSV reader to not provide a byte position of this kind, because it really doesn't make much sense in a CSV context. After all, "data" and data are the exact same four bytes of data as far as CSV is concerned, but the d might be the 2nd byte or the 1st byte depending on whether the optional surrounding quotes were used.

like image 99
John Y Avatar answered Nov 07 '22 11:11

John Y


Short answer: not possible. The byte position is not available through the csvreader API

like image 22
Andreas Jung Avatar answered Nov 07 '22 12:11

Andreas Jung