I have large csv files and excel files where I read them and create the needed create table script dynamically depending on the fields and types it has. Then insert the data to the created table.
I have read this and understood that I should send them with jobs.insert()
instead of tabledata.insertAll()
for large amount of data.
This is how I call it (Works for smaller files not large ones).
result = client.push_rows(datasetname,table_name,insertObject) # insertObject is a list of dictionaries
When I use library's push_rows it gives this error in windows.
[Errno 10054] An existing connection was forcibly closed by the remote host
and this in ubuntu.
[Errno 32] Broken pipe
So when I went through BigQuery-Python code it uses table_data.insertAll()
.
How can I do this with this library? I know we can upload through Google storage but I need direct upload method with this.
If using cloud storage is an option, you can put them all in a common prefix in a bucket and use a wildcard e.g. gs://my_bucket/some/path/files* to specify a single load job with multiple inputs quickly.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails. gsutil
has a more robust uploading algorithm than just a plain POST.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
See also BigQuery script failing for large file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With