I have seen this question: How to read first 2 rows of csv from Google Cloud Storage
But in my case, I don't want to load whole csv blob into memory, as it could be huge. Is there any way to open it as some iterable (or file-like object), and read only bytes of first couple of lines?
Wanted to expand answer of simzes with example of how to create iterable in cases where we do not know size of CSV header. Also could be useful for reading CSV from datastore line by line:
def get_csv_header(blob):
for line in csv.reader(blob_lines(blob)):
return line
# How much bytes of blob download using one request.
# Selected experimentally. If there is more optimal value for this - please update.
BLOB_CHUNK_SIZE = 2000
def blob_lines(blob: storage.blob.Blob) -> Generator[str, None, None]:
position = 0
buff = []
while True:
chunk = blob.download_as_string(start=position, end=position + BLOB_CHUNK_SIZE).decode()
if '\n' in chunk:
part1, part2 = chunk.split('\n', 1)
buff.append(part1)
yield ''.join(buff)
parts = part2.split('\n')
for part in parts[:-1]:
yield part
buff = [parts[-1]]
else:
buff.append(chunk)
position += BLOB_CHUNK_SIZE + 1 # Blob chunk is downloaded using closed interval
if len(chunk) < BLOB_CHUNK_SIZE:
yield ''.join(buff)
return
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With