Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Put entity consistently slow on Python Cloud Datastore

I am using Google Cloud Datastore through the Python library in a Python 3 flexible app engine environment. My flask application creates an object and then adds it to the datastore with:

ds = datastore.Client()
ds.put(entity)

In my testing, each call to put takes 0.5-1.5 seconds to complete. This does not change if I make two calls immediately one after the other like here.

I am wondering if the complexity of my object is the problem. It is multi-layered something like:

object = {
    a: 1,
    ...,
    b: [
        {
            d: 2,
            ...,
            e: {
                h: 3
            }
        }
    ],
    c: [
        {
            f: 4,
            ...,
            g: {
                i: 5
            }
        }
    ]
}

which I am creating by nesting datastore.Entity's, each initialised with something like:

entity = datastore.Entity(key=ds.key(KIND))
entity.update(object_dictionary)

Both lists are 3-4 items long. The JSON equivalent of the object is ~2-3kb.

Is this not the recommended practice? What should I be doing instead?

More info:

I do not currently wrap this put of an Entity in a transaction. put is just a thin wrapper over put_multi. put_multi appears to create a batch, send the Entity, then commit the batch.

I do not specify the object's "Name/ID" (title from datastore online console). I allow the library to decide that for me:

datastore.key(KIND)

where KIND is just a string specifying my collection's name. The alternative to this would be:

datastore.key(KIND, <some ID>)

which I use for updating objects, rather that here where I am initially creating the object. The keys generated by the library are increasing with time, but not monotonically (e.g: id=4669294231158784, id=4686973524508672).

I am not 100% sure of the terminology of what I am doing ("are entities are in the same entity group, or if you use indexed properties"), but people seem to refer to the process as an "Embedded Entity" (i.e. here). In the datastore online console, under the entities section I only have a single "kind", not multiple kinds for each of my sub objects. Does that answer your question, or can I find this out somehow?

I only have one index on the collection, on a separate ID field which is a reference to another object in a different database for cross database lookup.

like image 663
Jon G Avatar asked Feb 12 '18 15:02

Jon G


2 Answers

You can increase the performance of multiple consecutive writes (reads as well) by using Batch operations:

Batch operations

Cloud Datastore supports batch versions of the operations which allow it to operate on multiple objects in a single Cloud Datastore call.

Such batch calls are faster than making separate calls for each individual entity because they incur the overhead for only one service call. If multiple entity groups are involved, the work for all the groups is performed in parallel on the server side.

client.put_multi([task1, task2])
like image 127
Dan Cornilescu Avatar answered Nov 16 '22 07:11

Dan Cornilescu


Aside from the batching recommendation in the other answer, there are other practice that would decrease your "put" time.

When you perform a "write" on Datastore, you are actually writing your data multiple times to multiple tables (indices) to increase performance. Datastore is optimized for query-time performance by sacrificing a bit of writing time efficiency and storage. So for example, if you indexed three normal fields, every write basically updates three (sorted) tables. Normally fields that will not be queried should not be indexed, this will save you time and money.

The effect of "over-indexing" is even worse when you have repeated or nested fields because of the "exploding index" effect. Essentially, your data is "flattened" before they are stored so having multiple repeated fields will result in multiplicative increase in write cost and time.

like image 29
Ying Li Avatar answered Nov 16 '22 09:11

Ying Li