Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution.

I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?

like image 501
MT467 Avatar asked Mar 06 '19 23:03

MT467


1 Answers

Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:

  1. Export + Download:

    bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv &&  gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/ 
    
  2. Load into Dask:

    >>> import dask.dataframe as dd
    >>> df = dd.read_csv("/my/local/dir/*.csv")
    

Hope it helps.

like image 141
khan Avatar answered Sep 20 '22 06:09

khan