Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain csv-like parse AND line length byte count?

Tags:

python

csv

I'm familiar with the csv Python module, and believe it's necessary in my case, as I have some fields that contain the delimiter (| rather than ,, but that's irrelevant) within quotes.

However, I am also looking for the byte-count length of each original row, prior to splitting into columns. I can't count on the data to always quote a column, and I don't know if/when csv will strip off outer quotes, so I don't think (but might be wrong) that simply joining on my delimiter will reproduce the original line string (less CRLF characters). Meaning, I'm not positive the following works:

with open(fname) as fh:
    reader = csv.reader(fh, delimiter="|")
    for row in reader:
        original = "|".join(row) ## maybe?

I've tried looking at csv to see if there was anything in there that I could use/monkey-patch for this purpose, but since _csv.reader is a .so, I don't know how to mess around with that.

In case I'm dealing with an XY problem, my ultimate goal is to read through a CSV file, extracting certain fields and their overall file offsets to create a sort of look-up index. That way, later, when I have a list of candidate values, I can check each one's file-offset and seek() there, instead of chugging through the whole file again. As an idea of scale, I might have 100k values to look up across a 10GB file, so re-reading the file 100k times doesn't feel efficient to me. I'm open to other suggestions than the CSV module, but will still need csv-like intelligent parsing behavior.

EDIT: Not sure how to make it more clear than the title and body already explains - simply seek()-ing on a file handle isn't sufficient because I also need to parse the lines as a csv in order to pull out additional information.

like image 900
dwanderson Avatar asked Jan 20 '26 11:01

dwanderson


2 Answers

You can't subclass _csv.reader, but the csvfile argument to the csv.reader() constructor only has to be a "file-like object". This means you could supply an instance of your own class that does some preprocessing—such as remembering the length of the last line read and file offset. Here's an implementation showing exactly that. Note that the line length does not include the end-of-line character(s). It also shows how the offsets to each line/row could be stored and used after the file is read.

import csv

class CSVInputFile(object):
    """ File-like object. """
    def __init__(self, file):
        self.file = file
        self.offset = None
        self.linelen = None
    def __iter__(self):
        return self
    def __next__(self):
        offset = self.file.tell()
        data = self.file.readline()
        if not data:
            raise StopIteration
        self.offset = offset
        self.linelen = len(data)
        return data
    next = __next__

offsets = []  # remember where each row starts
fname = 'unparsed.csv'
with open(fname) as fh:
    csvfile = CSVInputFile(fh)
    for row in csv.reader(csvfile, delimiter="|"):
        print('offset: {}, linelen: {}, row: {}'.format(
            csvfile.offset, csvfile.linelen, row))  # file offset and length of row
        offsets.append(csvfile.offset)  # remember where each row started
like image 150
martineau Avatar answered Jan 21 '26 23:01

martineau


Depending on performance requirements and the size of the data, the low tech solution is to simply read the file twice. Make a first pass where you get the length of each line, and then then you can run the data through the csv parser. On my somewhat outdated Mac I can read and count the length of 2-3 million lines in a second, which isn't a huge performance hit.

like image 41
Bryan Oakley Avatar answered Jan 22 '26 00:01

Bryan Oakley



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!