Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

App Engine High Replication Datastore

I'm a total App Engine newbie, and I want to confirm my understanding of the high replication datastore.

The documentation says that entity groups are a "unit of consistency", and that all data is eventually consistent. Along the same lines, it also says "queries across entity groups can be stale".

Can someone provide some examples where queries can be "stale"? Is it saying I could potentially save an entity without any parent (ie. it's own group), then query for it very soon after and not find it? Does it also imply that if I want data to be always 100% up-to-date I need to save them all in the same entity group?

Is the common workaround for this to use memcache to cache entities for a period of time longer than the average time it takes for data to become consistent across all data centers? What's the ballpark latency for that?

Thanks

like image 776
amatsukawa Avatar asked May 30 '11 07:05

amatsukawa


2 Answers

Is it saying I could potentially save an entity without any parent (ie. it's own group), then query for it very soon after and not find it?

Correct. Technically, this is the case for the regular Master-Slave datastore, too, as indexes are updated asynchronously, but in practice the window of time in which that could happen is so incredibly small you never see it.

If by "query" you mean "do a get by key", though, that will always return strongly consistent results in either implementation.

Does it also imply that if I want data to be always 100% up-to-date I need to save them all in the same entity group?

You'll need to define what you mean by "100% up-to-date" before it's possible to answer that.

Is the common workaround for this to use memcache to cache entities for a period of time longer than the average time it takes for data to become consistent across all data centers?

No. Memcache is strictly for improving access times; you shouldn't use it in any situation where cache eviction will cause trouble.

Strongly consistent gets are always available to you if you need to guarantee that you're seeing the latest version. Without a concrete example of what you're trying to do, though, it's difficult to provide a recommendation.

like image 52
Nick Johnson Avatar answered Oct 16 '22 10:10

Nick Johnson


Obligatory blog example setup; Authors have Posts

class Author(db.Model):
    name = db.StringProperty()

class Post(db.Model):
    author = db.ReferenceProperty()
    article = db.TextProperty()

bob = Author(name='bob')
bob.put()

first thing to remember is that regular get/put/delete on a single entity group (including single entity) will work as expected:

post1 = Post(article='first article', author=bob)
post1.put()

fetched_post = Post.get(post1.key())
# fetched_post is latest post1

You will only be able notice inconstancy if you start querying across multiple entity groups. Unless you have specified a parent attribute, all your entities are in separate entity groups. So if it was important that straight after bob creates a post, that he can see there own post then we should be careful with the following:

fetched_posts = Post.all().filter('author =', bob).fetch(x)
# fetched_posts _might_ contain latest post1

fetched_posts might contain the latest post1 from bob, but it might not. This is because all the Posts are not in the same entity group. When querying like this in HR you should think "fetch me probably the latest posts for bob".

Since it is important in our application that the author can see his post in the list straight after creating it, we will use the parent attribute to tie them together, and use an ancestor query to fetch the posts only from within that group:

post2 = Post(parent=person, article='second article', author=bob)
post2.put()

bobs_posts = Post.all().ancestor(bob.key()).filter('author =', bob).fetch(x)

Now we know that post2 will be in our bobs_posts results.

If the aim of our query was to fetch "probably all the latest posts + definitely latest posts by bob" we would need to do another query.

other_posts = Post.all().fetch(x)

Then merge the results other_posts and bobs_posts together to get the desired result.

like image 39
Chris Farmiloe Avatar answered Oct 16 '22 12:10

Chris Farmiloe