We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB. On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second). We had tried several strategies to speed up the entities retrieval: <blockquote> <ul> <li>Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.</li> <li>Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.</li> </ul> </blockquote> The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case. We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval? UPDATE: Supplying additional information about our use case and parallelization result: <blockquote> <ul> <li>We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only. </li> <li>We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.</li> </ul> </blockquote> UPDATE 2: Giving an example of what we are trying to do: <blockquote> <ol> <li>Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not. </li> <li>The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value. </li> <li>The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.</li> </ol> </blockquote>

200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory. Do you need the whole 5kb of each entity for computation? Do you need all 200k entities when querying before computation? Do queries touch all entities? Also, check out BigQuery. It might suit your needs.

How to retrieve huge (>2000) amount of entities from GAE datastore in under 1 second?

Q: What is ancestor in Datastore?

Ancestors and Entity Groups For highly related or hierarchical data, Datastore allows entities to be stored in a parent/child relationship. This is known as an entity group or ancestor/descendent relationship. This is an example of an entity group with kinds of types person, pet, and toy.

Q: What is Kind in Datastore?

Each entity in a Datastore mode database has a key that uniquely identifies it. The key consists of the following components: The namespace of the entity, which allows for multitenancy. The kind of the entity, which categorizes it for the purpose of queries. An identifier for the individual entity, which can be either.

Q: How do I add entities to Google Datastore?

In Java, you create a new entity by constructing an instance of class Entity , supplying the entity's kind as an argument to the Entity() constructor. After populating the entity's properties if necessary, you save it to the datastore by passing it as an argument to the DatastoreService. put() method.

Tags:

java

performance

google-app-engine

google-cloud-datastore

google-bigquery

We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.

On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).

We had tried several strategies to speed up the entities retrieval:

Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.

Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.

The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.

We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?

UPDATE: Supplying additional information about our use case and parallelization result:

We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.

We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.

UPDATE 2: Giving an example of what we are trying to do:

Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.

The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.

The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.

589

asked Jan 05 '12 14:01

Ibrahim Arief

2 Answers

200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.

Do you need the whole 5kb of each entity for computation? Do you need all 200k entities when querying before computation? Do queries touch all entities?

Also, check out BigQuery. It might suit your needs.

145

answered Sep 29 '22 23:09

Peter Knego

Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.

answered Sep 29 '22 22:09

Viruzzo

Related questions
                            
                                Spring 3 SimpleMappingExceptionResolver warnLogCategory log4j
                            
                                how to implement TCP server and TCP client in java to transfer files
                            
                                Why DragHandler exportAsDrag disables my MouseMotionListener?
                            
                                iText + HTMLWorker - How to change default font?
                            
                                Is it possible to interrupt a Java RMI call?
                            
                                Javadoc reference param from another method
                            
                                Calculating the perplexity of a language model for email classification
                            
                                Linux-Java logs rotating using log4j or logrotate.d
                            
                                connection pool for proprietary connection api (non jdbc)
                            
                                notifyAll() number of invocations difference while profiling
                            
                                Error while parsing XML file with StAx
                            
                                How do I implement graceful termination in Java?
                            
                                How do I mimic HybridUrlCodingStrategy in Wicket 1.5?
                            
                                EditTextCell FieldUpdater width
                            
                                Serving static content with Spring 3
                            
                                What are the formal conditions for a wildcard parameter in a Java generic type to be within its bounds?
                            
                                Java library to check a video's metadata (if it's 1080p, 720p, etc) [closed]
                            
                                Eliminate type parameter of java generics
                            
                                What are the consequences if we try to attach a Native Thread permanently to the DVM (JVM)?
                            
                                JBoss 7: how to dynamically load jars

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With