Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read csv from Google Cloud storage to pandas dataframe

I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from io import BytesIO  from google.cloud import storage  storage_client = storage.Client() bucket = storage_client.get_bucket('createbucket123') blob = bucket.blob('my.csv') path = "gs://createbucket123/my.csv" df = pd.read_csv(path) 

It shows this error message:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist 

What am I doing wrong, I am not able to find any solution which does not involve google datalab?

like image 759
user1838940 Avatar asked Mar 19 '18 06:03

user1838940


People also ask

How do I download a CSV file from Google cloud?

In the Project text field, enter the Google Cloud Platform Project ID that contains the GCS bucket where you would like to save the CSV file. In the Bucket text field, enter the GCS bucket name where you would like to save the CSV file.


1 Answers

UPDATE

As of version 0.24 of pandas, read_csv supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:

df = pd.read_csv('gs://bucket/your_path.csv') 

The read_csv will then use gcsfs module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).

I leave three other options for the sake of completeness.

  • Home-made code
  • gcsfs
  • dask

I will cover them below.

The hard way: do-it-yourself code

I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.

It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.

How to use it:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path') df = pd.read_csv(fileobj) 

The code:

from io import BytesIO, StringIO from google.cloud import storage from google.oauth2 import service_account  def get_byte_fileobj(project: str,                      bucket: str,                      path: str,                      service_account_credentials_path: str = None) -> BytesIO:     """     Retrieve data from a given blob on Google Storage and pass it as a file object.     :param path: path within the bucket     :param project: name of the project     :param bucket_name: name of the bucket     :param service_account_credentials_path: path to credentials.            TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')     :return: file object (BytesIO)     """     blob = _get_blob(bucket, path, project, service_account_credentials_path)     byte_stream = BytesIO()     blob.download_to_file(byte_stream)     byte_stream.seek(0)     return byte_stream  def get_bytestring(project: str,                    bucket: str,                    path: str,                    service_account_credentials_path: str = None) -> bytes:     """     Retrieve data from a given blob on Google Storage and pass it as a byte-string.     :param path: path within the bucket     :param project: name of the project     :param bucket_name: name of the bucket     :param service_account_credentials_path: path to credentials.            TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')     :return: byte-string (needs to be decoded)     """     blob = _get_blob(bucket, path, project, service_account_credentials_path)     s = blob.download_as_string()     return s   def _get_blob(bucket_name, path, project, service_account_credentials_path):     credentials = service_account.Credentials.from_service_account_file(         service_account_credentials_path) if service_account_credentials_path else None     storage_client = storage.Client(project=project, credentials=credentials)     bucket = storage_client.get_bucket(bucket_name)     blob = bucket.blob(path)     return blob 

gcsfs

gcsfs is a "Pythonic file-system for Google Cloud Storage".

How to use it:

import pandas as pd import gcsfs  fs = gcsfs.GCSFileSystem(project='my-project') with fs.open('bucket/path.csv') as f:     df = pd.read_csv(f) 

dask

Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas API, making it easy to use for newcomers.

Here is the read_csv

How to use it:

import dask.dataframe as dd  df = dd.read_csv('gs://bucket/data.csv') df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!  # df is now Dask dataframe, ready for distributed processing # If you want to have the pandas version, simply: df_pd = df.compute() 
like image 90
Lukasz Tracewski Avatar answered Oct 05 '22 11:10

Lukasz Tracewski