We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
- Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
- Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
- We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
- We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
- Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
- The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
- The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.
Ancestors and Entity Groups For highly related or hierarchical data, Datastore allows entities to be stored in a parent/child relationship. This is known as an entity group or ancestor/descendent relationship. This is an example of an entity group with kinds of types person, pet, and toy.
Each entity in a Datastore mode database has a key that uniquely identifies it. The key consists of the following components: The namespace of the entity, which allows for multitenancy. The kind of the entity, which categorizes it for the purpose of queries. An identifier for the individual entity, which can be either.
In Java, you create a new entity by constructing an instance of class Entity , supplying the entity's kind as an argument to the Entity() constructor. After populating the entity's properties if necessary, you save it to the datastore by passing it as an argument to the DatastoreService. put() method.
200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation? Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.
Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With