How to read only first row of csv from Google Cloud Storage?

Question

I have seen this question: How to read first 2 rows of csv from Google Cloud Storage

But in my case, I don't want to load whole csv blob into memory, as it could be huge. Is there any way to open it as some iterable (or file-like object), and read only bytes of first couple of lines?

Bunyk · Accepted Answer

Wanted to expand answer of simzes with example of how to create iterable in cases where we do not know size of CSV header. Also could be useful for reading CSV from datastore line by line:

def get_csv_header(blob):
    for line in csv.reader(blob_lines(blob)):
        return line


# How much bytes of blob download using one request.
# Selected experimentally. If there is more optimal value for this - please update.
BLOB_CHUNK_SIZE = 2000


def blob_lines(blob: storage.blob.Blob) -> Generator[str, None, None]:
    position = 0
    buff = []
    while True:
        chunk = blob.download_as_string(start=position, end=position + BLOB_CHUNK_SIZE).decode()
        if '
' in chunk:
            part1, part2 = chunk.split('
', 1)
            buff.append(part1)
            yield ''.join(buff)
            parts = part2.split('
')
            for part in parts[:-1]:
                yield part
            buff = [parts[-1]]
        else:
            buff.append(chunk)

        position += BLOB_CHUNK_SIZE + 1  # Blob chunk is downloaded using closed interval
        if len(chunk) < BLOB_CHUNK_SIZE:
            yield ''.join(buff)
            return

How to read only first row of csv from Google Cloud Storage?

Tags:

python

google-cloud-platform

google-cloud-storage

Bunyk

1 Answers

Bunyk

Recent Activity

Donate For Us

How to read only first row of csv from Google Cloud Storage?

Tags:

python

google-cloud-platform

google-cloud-storage

Bunyk

1 Answers

Bunyk

Related questions

Recent Activity

Donate For Us