We are migrating some data from our production database and would like to archive most of this data in the Cloud Datastore. Eventually we would move all our data there, however initially focusing on the archived data as a test. Our language of choice is Python, and have been able to transfer data from mysql to the datastore row by row. We have approximately 120 million rows to transfer and at a one row at a time method will take a very long time. Has anyone found some documentation or examples on how to bulk insert data into cloud datastore using python? Any comments, suggestions is appreciated thank you in advanced.

There is no "bulk-loading" feature for Cloud Datastore that I know of today, so if you're expecting something like "upload a file with all your data and it'll appear in Datastore", I don't think you'll find anything. You could always write a quick script using a local queue that parallelizes the work. The basic gist would be: <ul> <li>Queuing script pulls data out of your MySQL instance and puts it on a queue.</li> <li>(Many) Workers pull from this queue, and try to write the item to Datastore.</li> <li>On failure, push the item back on the queue.</li> </ul> Datastore is massively parallelizable, so if you can write a script that will send off thousands of writes per second, it should work just fine. Further, your big bottleneck here will be network IO (after you send a request, you have to wait a bit to get a response), so lots of threads should get a pretty good overall write rate. However, it'll be up to you to make sure you split the work up appropriately among those threads. <hr> Now, that said, you should investigate whether Cloud Datastore is the right fit for your data and durability/availability needs. If you're taking 120m rows and loading it into Cloud Datastore for key-value style querying (aka, you have a <code>key</code> and an unindexed <code>value</code> property which is just JSON data), then this might make sense, but loading your data will cost you ~$70 in this case (120m * $0.06/100k). If you have properties (which will be indexed by default), this cost goes up substantially. The cost of operations is $0.06 per 100k, but a single "write" may contain several "operations". For example, let's assume you have 120m rows in a table that has 5 columns (which equates to one Kind with 5 properties). A single "new entity write" is equivalent to: <ul> <li>+ 2 (1 x 2 write ops fixed cost per new entity)</li> <li>+ 10 (5 x 2 write ops per indexed property)</li> <li>= 12 "operations" per entity.</li> </ul> So your actual cost to load this data is: 120m entities * 12 ops/entity * ($0.06/100k ops) = $864.00

Is it possible to Bulk Insert using Google Cloud Datastore

1 Answers

There is no "bulk-loading" feature for Cloud Datastore that I know of today, so if you're expecting something like "upload a file with all your data and it'll appear in Datastore", I don't think you'll find anything.

You could always write a quick script using a local queue that parallelizes the work.

The basic gist would be:

Queuing script pulls data out of your MySQL instance and puts it on a queue.
(Many) Workers pull from this queue, and try to write the item to Datastore.
On failure, push the item back on the queue.

Datastore is massively parallelizable, so if you can write a script that will send off thousands of writes per second, it should work just fine. Further, your big bottleneck here will be network IO (after you send a request, you have to wait a bit to get a response), so lots of threads should get a pretty good overall write rate. However, it'll be up to you to make sure you split the work up appropriately among those threads.

Now, that said, you should investigate whether Cloud Datastore is the right fit for your data and durability/availability needs. If you're taking 120m rows and loading it into Cloud Datastore for key-value style querying (aka, you have a key and an unindexed value property which is just JSON data), then this might make sense, but loading your data will cost you ~$70 in this case (120m * $0.06/100k).

If you have properties (which will be indexed by default), this cost goes up substantially.

The cost of operations is $0.06 per 100k, but a single "write" may contain several "operations". For example, let's assume you have 120m rows in a table that has 5 columns (which equates to one Kind with 5 properties).

A single "new entity write" is equivalent to:

+ 2 (1 x 2 write ops fixed cost per new entity)
+ 10 (5 x 2 write ops per indexed property)
= 12 "operations" per entity.

So your actual cost to load this data is:

120m entities * 12 ops/entity * ($0.06/100k ops) = $864.00

170

answered Oct 11 '22 23:10

JJ Geewax

Related questions
                            
                                Converting some columns from pandas dataframe to list of lists
                            
                                AUC-base Features Importance using Random Forest
                            
                                Finding when a value in a pandas.Series crosses/reaches a threshold
                            
                                TypeError: int() argument must be a string or a number, not 'Model Instance'
                            
                                Different YAML array representations
                            
                                how to select columns from R dataframe in rpy2 in python?
                            
                                Python Enum for Boolean variable
                            
                                Can't turn off images in Selenium / Firefox
                            
                                How to set grequests timeout
                            
                                Using Python's multiprocessing.pool.map to manipulate the same integer
                            
                                How to make values in list of dictionary unique?
                            
                                what is the significance of `__repr__` function over normal function [duplicate]
                            
                                Why can't I import statsmodels directly?
                            
                                Add Timestamp to ElasticSearch with Elasticsearch-py using Bulk-API
                            
                                modern approach to 3D visualization in python: discuss mayavi
                            
                                How to detect write failure in asyncio?
                            
                                Django admin asks for login after every click
                            
                                Pycharm: How to adjust color of variable/syntax highlighting?
                            
                                Numpy String Encoding
                            
                                PyStruct - No matching signature find

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to Bulk Insert using Google Cloud Datastore

Tags:

python

mysql

google-cloud-datastore

ADL

People also ask

1 Answers

JJ Geewax

Recent Activity

Donate For Us