Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Querying for N random records on Appengine datastore

I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.

I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.

This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.

like image 576
aloo Avatar asked Jul 09 '09 16:07

aloo


3 Answers

"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.

N independent uniform random selections are not (in general) consecutive in any index.

QED.

You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.

like image 59
Steve Jessop Avatar answered Oct 13 '22 19:10

Steve Jessop


What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.

Here's what you need to do:

When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)

Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.

This method does require dealing with a few annoying details:

  1. Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
  2. You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")

So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.

like image 5
Peter Recore Avatar answered Oct 13 '22 17:10

Peter Recore


Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.

Unfortunately this will require processing of all entities once if your datastore is already filled in.

It's weird, I know.

like image 3
Slava Tutushkin Avatar answered Oct 13 '22 19:10

Slava Tutushkin