Read csv from Google Cloud storage to pandas dataframe

Tags:

I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from io import BytesIO  from google.cloud import storage  storage_client = storage.Client() bucket = storage_client.get_bucket('createbucket123') blob = bucket.blob('my.csv') path = "gs://createbucket123/my.csv" df = pd.read_csv(path)

It shows this error message:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

What am I doing wrong, I am not able to find any solution which does not involve google datalab?

759

asked Mar 19 '18 06:03

user1838940

1 Answers

UPDATE

As of version 0.24 of pandas, read_csv supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:

df = pd.read_csv('gs://bucket/your_path.csv')

The read_csv will then use gcsfs module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).

I leave three other options for the sake of completeness.

Home-made code
gcsfs
dask

I will cover them below.

The hard way: do-it-yourself code

I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.

It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.

How to use it:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path') df = pd.read_csv(fileobj)

The code:

from io import BytesIO, StringIO from google.cloud import storage from google.oauth2 import service_account  def get_byte_fileobj(project: str,                      bucket: str,                      path: str,                      service_account_credentials_path: str = None) -> BytesIO:     """     Retrieve data from a given blob on Google Storage and pass it as a file object.     :param path: path within the bucket     :param project: name of the project     :param bucket_name: name of the bucket     :param service_account_credentials_path: path to credentials.            TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')     :return: file object (BytesIO)     """     blob = _get_blob(bucket, path, project, service_account_credentials_path)     byte_stream = BytesIO()     blob.download_to_file(byte_stream)     byte_stream.seek(0)     return byte_stream  def get_bytestring(project: str,                    bucket: str,                    path: str,                    service_account_credentials_path: str = None) -> bytes:     """     Retrieve data from a given blob on Google Storage and pass it as a byte-string.     :param path: path within the bucket     :param project: name of the project     :param bucket_name: name of the bucket     :param service_account_credentials_path: path to credentials.            TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')     :return: byte-string (needs to be decoded)     """     blob = _get_blob(bucket, path, project, service_account_credentials_path)     s = blob.download_as_string()     return s   def _get_blob(bucket_name, path, project, service_account_credentials_path):     credentials = service_account.Credentials.from_service_account_file(         service_account_credentials_path) if service_account_credentials_path else None     storage_client = storage.Client(project=project, credentials=credentials)     bucket = storage_client.get_bucket(bucket_name)     blob = bucket.blob(path)     return blob

gcsfs

gcsfs is a "Pythonic file-system for Google Cloud Storage".

How to use it:

import pandas as pd import gcsfs  fs = gcsfs.GCSFileSystem(project='my-project') with fs.open('bucket/path.csv') as f:     df = pd.read_csv(f)

dask

Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas API, making it easy to use for newcomers.

Here is the read_csv

How to use it:

import dask.dataframe as dd  df = dd.read_csv('gs://bucket/data.csv') df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!  # df is now Dask dataframe, ready for distributed processing # If you want to have the pandas version, simply: df_pd = df.compute()

answered Oct 05 '22 11:10

Lukasz Tracewski

Related questions
                            
                                What are Python pandas equivalents for R functions like str(), summary(), and head()?
                            
                                Access dict key and return None if doesn't exist
                            
                                subprocess.check_output return code
                            
                                Clear all items from the queue
                            
                                Moving all files from one directory to another using Python
                            
                                Using the AND and NOT Operator in Python [duplicate]
                            
                                How to know bytes size of python object like arrays and dictionaries? - The simple way
                            
                                Delete a subplot
                            
                                Determine if directory is under git control
                            
                                python selenium webscraping "NoSuchElementException" not recognized
                            
                                Comparing two pandas dataframes for differences
                            
                                How to encode text to base64 in python
                            
                                Python string to attribute
                            
                                python: deque vs list performance comparison
                            
                                Is it safe to replace a self object by another object of the same type in a method?
                            
                                How to convert a time string to seconds?
                            
                                Diff of two Dataframes
                            
                                Equivalent of Numpy.argsort() in basic python? [duplicate]
                            
                                How to specify where a Tkinter window opens?
                            
                                Python Pandas Histogram Log Scale

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read csv from Google Cloud storage to pandas dataframe

Tags:

python

pandas

csv

google-cloud-platform

google-cloud-storage

user1838940

People also ask

1 Answers

UPDATE

The hard way: do-it-yourself code

gcsfs

dask

Lukasz Tracewski

Recent Activity

Donate For Us