Insert large amount of data to BigQuery via bigquery-python library

Tags:

I have large csv files and excel files where I read them and create the needed create table script dynamically depending on the fields and types it has. Then insert the data to the created table.

I have read this and understood that I should send them with jobs.insert() instead of tabledata.insertAll() for large amount of data.

This is how I call it (Works for smaller files not large ones).

result  = client.push_rows(datasetname,table_name,insertObject) # insertObject is a list of dictionaries

When I use library's push_rows it gives this error in windows.

[Errno 10054] An existing connection was forcibly closed by the remote host

and this in ubuntu.

[Errno 32] Broken pipe

So when I went through BigQuery-Python code it uses table_data.insertAll().

How can I do this with this library? I know we can upload through Google storage but I need direct upload method with this.

294

asked Aug 16 '16 09:08

Marlon Abeykoon

1 Answers

When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.

The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails. gsutil has a more robust uploading algorithm than just a plain POST.

Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.

See also BigQuery script failing for large file

answered Sep 30 '22 14:09

Felipe Hoffa

Related questions
                            
                                SQLAlchemy Declarative: How to merge models and existing business logic classes
                            
                                Python Pyinstaller 3.1 Intel MKL FATAL ERROR: Cannot load mkl_intel_thread.dll
                            
                                Python ARIMA model, predicted values are shifted
                            
                                AWS Lambda w/ Python UUID on Dynamo DB (Concept)
                            
                                Efficient reduction of multiple tensors in Python
                            
                                Setting up a scheduled / cron job with Django on Elastic Beanstalk with a Worker Tier
                            
                                How often does python-requests perform dns queries
                            
                                How to control memory while using Keras with tensorflow backend?
                            
                                Any way to find all possible kwargs for a function in python from cli?
                            
                                Decide when to refresh OAUTH2 token with Python Social Auth
                            
                                Exclude manylinux wheels when downloading from pip
                            
                                Python: l2-Penalty for logistic regression model from statsmodels?
                            
                                PySpark: TypeError: 'Row' object does not support item assignment
                            
                                Using python with Anaconda in Windows
                            
                                Python : Ramer-Douglas-Peucker (RDP) algorithm with number of points instead of epsilon
                            
                                Implementing Tuples and Lists in the isinstance Function in Python 2.7
                            
                                How to apply RANSAC in Python OpenCV
                            
                                Zoom action in android using appium-python-client
                            
                                ImportError: No module named setuptools.command on Mac OS X within virtualenv
                            
                                How __reduce__ function exactly works in case of pickle module?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Insert large amount of data to BigQuery via bigquery-python library

Tags:

python

large-data

python-2.7

google-bigquery

Marlon Abeykoon

People also ask

1 Answers

Felipe Hoffa

Recent Activity

Donate For Us