Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write-streaming to Google Cloud Storage in Python

I am trying to migrate an AWS Lambda function written in Python to CF that

  1. unzips on-the-fly and read line-by-line
  2. performs some light transformations on each line
  3. write output (a line at a time or chunks) uncompressed to GCS

The output is > 2GB - but slightly less than 3GB so it fits in Lambda, just.

Well, it seems impossible or way more involved in GCP:

  • uncompressed cannot fit in memory or /tmp - limited to 2048MB as of writing this - so Python Client lib upload_from_file (or _filename) cannot be used
  • there is this official paper but to my surprise, it's referring to boto, a library initially designed for AWS S3, and a quite outdated one since boto3 is out for some time. No genuine GCP method to stream write or read
  • Node.js has a simple createWriteStream() - nice article here btw - but no equivalent one-liner in Python
  • Resumable media upload sounds like it but lot of code for something handled in Node much easier
  • AppEngine had cloudstorage but not available outside of it - and obsolete
  • little to no example out there on a working wrapper for writing text/plain data line-by-line as if GCS was a local filesystem. This is not limited to Cloud Functions and a lacking feature of the Python Client library, but it is more acute in CF due the resource constraints. Btw, I was part of a discussion to add a writeable IOBase function but it had no traction.
  • obviously using a VM or DataFlow are out of question for the task at hand.

In my mind, stream (or stream-like) reading/writing from cloud-based storage should even be included in the Python standard library.

As recommended back then, one can still use GCSFS, which behind the scenes commits the upload in chunks for you while you are writing stuff to a FileObj. The same team wrote s3fs. I don't know for Azure.

AFAIC, I will stick to AWS Lambda as the output can fit in memory - for now - but multipart upload is the way to go to support any output size with a minimum of memory.

Thoughts or alternatives ?

like image 966
Yannick Einsweiler Avatar asked Oct 30 '18 16:10

Yannick Einsweiler


Video Answer


1 Answers

I got confused with multipart vs. resumable upload. The latter is what you need for "streaming" - it's actually more like uploading chunks of a buffered stream.

Multipart upload is to load data and custom metadata at once, in the same API call.

While I like GCSFS very much - Martin, his main contributor is very responsive -, I recently found an alternative that uses the google-resumable-media library.

GCSFS is built upon the core http API whereas Seth's solution uses a low-level library maintained by Google, more in sync with API changes and which includes exponential backup. The latter is really a must for large/long stream as connection may drop, even within GCP - we faced the issue with GCF.

On a closing note, I still believe that the Google Cloud Library is the right place to add stream-like functionality, with basic write and read. It has the core code already.

If you too are interested in that feature in the core lib, thumbs up the issue here - assuming priority is based thereon.

like image 159
Yannick Einsweiler Avatar answered Sep 22 '22 08:09

Yannick Einsweiler