Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read only first row of csv from Google Cloud Storage?

I have seen this question: How to read first 2 rows of csv from Google Cloud Storage

But in my case, I don't want to load whole csv blob into memory, as it could be huge. Is there any way to open it as some iterable (or file-like object), and read only bytes of first couple of lines?

like image 597
Bunyk Avatar asked Oct 27 '25 01:10

Bunyk


1 Answers

Wanted to expand answer of simzes with example of how to create iterable in cases where we do not know size of CSV header. Also could be useful for reading CSV from datastore line by line:

def get_csv_header(blob):
    for line in csv.reader(blob_lines(blob)):
        return line


# How much bytes of blob download using one request.
# Selected experimentally. If there is more optimal value for this - please update.
BLOB_CHUNK_SIZE = 2000


def blob_lines(blob: storage.blob.Blob) -> Generator[str, None, None]:
    position = 0
    buff = []
    while True:
        chunk = blob.download_as_string(start=position, end=position + BLOB_CHUNK_SIZE).decode()
        if '\n' in chunk:
            part1, part2 = chunk.split('\n', 1)
            buff.append(part1)
            yield ''.join(buff)
            parts = part2.split('\n')
            for part in parts[:-1]:
                yield part
            buff = [parts[-1]]
        else:
            buff.append(chunk)

        position += BLOB_CHUNK_SIZE + 1  # Blob chunk is downloaded using closed interval
        if len(chunk) < BLOB_CHUNK_SIZE:
            yield ''.join(buff)
            return
like image 71
Bunyk Avatar answered Oct 28 '25 16:10

Bunyk



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!