Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing Big Query from Cloud DataLab using Pandas

I have a Jypyter Notebook accessing Big Query using Pandas as the vehicle:

df = pd.io.gbq.read_gbq( query, project_id = 'xxxxxxx-xxxx' )

This works fine from my local machine! (great, in fact!) But when I load the same notebook to Cloud DataLab I get:

DistributionNotFound: google-api-python-client

Which seems rather disappointing! I believe that the module should be installed with Pandas.. but somehow Google is not including it? It would be most preferable for a bunch of reasons to not have to change the code from what we develop on our local machines to what is needed in Cloud DataLab, in this case we heavily parameterize the data access...

Ok I ran:

!pip install --upgrade google-api-python-client

Now when I run the notebook I get an auth prompt that I cannot resolve since DataLab is on a remote machine:

Your browser has been opened to visit:
 >>> Browser string>>>>
If your browser is on a different machine then exit and re-run this
application with the command-line parameter 

 --noauth_local_webserver

Don't see an obvious answer to this?

I use the code suggested below by @Anthonios Partheniou from within the same notebook (executing it in a cell block) after updating the google-api-python-client in the notebook and I got the following traceback:

TypeError                                 Traceback (most recent call last)
<ipython-input-3-038366843e56> in <module>()
  5                            scope='https://www.googleapis.com/auth/bigquery',
  6                            redirect_uri='urn:ietf:wg:oauth:2.0:oob')
----> 7 storage = Storage('bigquery_credentials.dat')
  8 authorize_url = flow.step1_get_authorize_url()
  9 print 'Go to the following link in your browser: ' + authorize_url

/usr/local/lib/python2.7/dist-packages/oauth2client/file.pyc in __init__(self, filename)
 37 
 38     def __init__(self, filename):
---> 39         super(Storage, self).__init__(lock=threading.Lock())
 40         self._filename = filename
 41 

 TypeError: object.__init__() takes no parameters

He mentions the need to be executing the notebook from the same folder yet the only way that I know of for executing a datalab notebook is via the repo?

While the new module of using the new Jupyter Datalab module is a possible alternative The ability to use the full Pandas BQ interface unchanged on local and DataLab instances would be hugely helpful! So xing my fingers for a solution!

pip installed:
GCPDataLab 0.1.0
GCPData 0.1.0
wheel 0.29.0
tensorflow 0.6.0
protobuf 3.0.0a3
oauth2client 1.4.12
futures 3.0.3
pexpect 4.0.1
terminado 0.6
pyasn1 0.1.9
jsonschema 2.5.1
mistune 0.7.2
statsmodels 0.6.1
path.py 8.1.2
ipython 4.1.2
nose 1.3.7
MarkupSafe 0.23
py-dateutil 2.2
pyparsing 2.1.1
pickleshare 0.6
pandas 0.18.0
singledispatch 3.4.0.3
PyYAML 3.11
nbformat 4.0.1
certifi 2016.2.28
notebook 4.0.2
cycler 0.10.0
scipy 0.17.0
ipython-genutils 0.1.0
pyasn1-modules 0.0.8
functools32 3.2.3-2
ipykernel 4.3.1
pandocfilters 1.2.4
decorator 4.0.9
jupyter-core 4.1.0
rsa 3.4.2
mock 1.3.0
httplib2 0.9.2
pytz 2016.3
sympy 0.7.6
numpy 1.11.0
seaborn 0.6.0
pbr 1.8.1
backports.ssl-match-hostname 3.5.0.1
ggplot 0.6.5
simplegeneric 0.8.1
ptyprocess 0.5.1
funcsigs 0.4
scikit-learn 0.16.1
traitlets 4.2.1
jupyter-client 4.2.2
nbconvert 4.1.0
matplotlib 1.5.1
patsy 0.4.1
tornado 4.3
python-dateutil 2.5.2
Jinja2 2.8
backports-abc 0.4
brewer2mpl 1.4.1
Pygments 2.1.3

end

like image 593
dartdog Avatar asked Jun 13 '16 14:06

dartdog


1 Answers

Google BigQuery authentication in pandas is normally straight forward, except when pandas code is executed on a remote server. For example, running pandas on Datalab in the cloud. In that case, use the following code to create the credentials file that pandas needs to access Google BigQuery in Google Datalab.

from oauth2client.client import OAuth2WebServerFlow
from oauth2client.file import Storage
flow = OAuth2WebServerFlow(client_id='<Client ID from Google API Console>',
                           client_secret='<Client secret from Google API Console>',
                           scope='https://www.googleapis.com/auth/bigquery',
                           redirect_uri='urn:ietf:wg:oauth:2.0:oob')
storage = Storage('bigquery_credentials.dat')
authorize_url = flow.step1_get_authorize_url()
print 'Go to the following link in your browser: ' + authorize_url
code = raw_input('Enter verification code: ')
credentials = flow.step2_exchange(code)
storage.put(credentials)

Once you complete the process I don't expect you will see the error (as long as the notebook is in the same folder as the newly created 'bigquery_credentials.dat' file).

You also need to install the google-api-python-client python package as it is required by pandas for Google BigQuery support. You can run either of the following in a notebook to install it.

Either

!pip install google-api-python-client --no-deps
!pip install uritemplate --no-deps
!pip install simplejson --no-deps

or

%%bash
pip install google-api-python-client --no-deps
pip install uritemplate --no-deps
pip install simplejson --no-deps

The --no-deps option is needed so that you don't accidentally update a python package which is installed in datalab by default (to ensure other parts of datalab don't break).

Note: With pandas 0.19.0 (not released yet), it will be much easier to use pandas in Google Cloud Datalab. See Pull Request #13608

Note: You also have the option to use the (new) google datalab module inside of jupyter (and that way the code will also work in Google Datalab on the cloud). See the following related stack overflow answer: How do I use gcp package from outside of google datalabs?

like image 95
Anthonios Partheniou Avatar answered Oct 22 '22 18:10

Anthonios Partheniou