Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting entities and filtering ListProperty without incurring in exploding indexes

I'm developing a simple Blogging/Bookmarking platform and I'm trying to add a tags-explorer/drill-down feature a là delicious to allow users to filter the posts specifying a list of specific tags.

Something like this: enter image description here

Posts are represented in the datastore with this simplified model:

class Post(db.Model):
    title = db.StringProperty(required = True)
    link = db.LinkProperty(required = True)
    description = db.StringProperty(required = True)
    tags = db.ListProperty(str)
    created = db.DateTimeProperty(required = True, auto_now_add = True)

Post's tags are stored in a ListProperty and, in order to retrieve the list of posts tagged with a specific list of tags, the Post model exposes the following static method:

@staticmethod
def get_posts(limit, offset, tags_filter = []):
        posts = Post.all()
        for tag in tags_filter:
          if tag:
              posts.filter('tags', tag)
        return posts.fetch(limit = limit, offset = offset)

This works well, although I've not stressed it too much.

The problem raises when I try to add a "sorting" order to the get_posts method to keep the result ordered by "-created" date:

@staticmethod
def get_posts(limit, offset, tags_filter = []):
        posts = Post.all()
        for tag in tags_filter:
          if tag:
              posts.filter('tags', tag)
        posts.order("-created")
        return posts.fetch(limit = limit, offset = offset)

The sorting order adds an index for each tag to filter, leading to the dreaded exploding indexes problem.
One last thing that makes this thing more complicated is that the get_posts method should provide some pagination mechanism.

Do you know any Strategy/Idea/Workaround/Hack to solve this problem?

like image 268
systempuntoout Avatar asked May 24 '11 09:05

systempuntoout


2 Answers

Queries involving keys use indexes just like queries involving properties. Queries on keys require custom indexes in the same cases as with properties, with a couple of exceptions: inequality filters or an ascending sort order on key do not require a custom index, but a descending sort order on Entity.KEY_RESERVED_PROPERTY_key_ does.

So use a sortable date string for the primary key of the entity:

class Post(db.Model):
    title = db.StringProperty(required = True)
    link = db.LinkProperty(required = True)
    description = db.StringProperty(required = True)
    tags = db.ListProperty(str)
    created = db.DateTimeProperty(required = True, auto_now_add = True)

    @classmethod
    def create(*args, **kw):
         kw.update(dict(key_name=inverse_millisecond_str() + disambig_chars()))
         return Post(*args, **kw)

...

def inverse_microsecond_str(): #gives string of 8 characters from ascii 23 to 'z' which sorts in reverse temporal order
    t = datetime.datetime.now()
    inv_us = int(1e16 - (time.mktime(t.timetuple()) * 1e6 + t.microsecond)) #no y2k for >100 yrs
    base_100_chars = []
    while inv_us:
        digit, inv_us = inv_us % 100, inv_us / 100
        base_100_str = [chr(23 + digit)] + base_100_chars
    return "".join(base_100_chars)

Now, you don't even have to include a sort order in your queries, although it won't hurt to explicitly sort by key.

Things to remember:

  • This won't work unless you use the "create" here for all your Posts.
  • You'll have to migrate old data
  • No ancestors allowed.
  • The key is stored once per index, so it is worthwhile to keep it short; that's why I'm doing the base-100 encoding above.
  • This is not 100% reliable because of the possibility of key collisions. The above code, without disambig_chars, nominally gives reliability of the number of microseconds between transactions, so if you had 10 posts per second at peak times, it would fail 1/100,000. However, I'd shave off a couple orders of magnitude for possible app engine clock tick issues, so I'd actually only trust it for 1/1000. If that's not good enough, add disambig_chars; and if you need 100% reliability, then you probably shouldn't be on app engine, but I guess you could include logic to handle key collisions on save().
like image 94
Jameson Quinn Avatar answered Nov 14 '22 23:11

Jameson Quinn


What if you inverted the relationship? Instead of a post with a list of tags you would have a tag entity with a list of posts.

class Tag(db.Model):
  tag = db.StringProperty()
  posts = db.ListProperty(db.Key, indexed=False)

To search for tags you would do tags = Tag.all().filter('tag IN', ['python','blog','async'])

This would give you hopefully 3 or more Tag entities, each with a list of posts that are using that tag. You could then do post_union = set(tags[0].posts).intersection(tags[1].posts, tags[2].posts) to find the set of posts that have all tags.

Then you could fetch those posts and order them by created (I think). Posts.all().filter('__key__ IN', post_union).order("-created")

Note: This code is off the top of my head, I can't remember if you can manipulate sets like that.

Edit: @Yasser pointed out that you can only do IN queries for < 30 items.

Instead you could have the key name for each post start with the creation time. Then you could sort the keys you retrieved via the first query and just do Posts.get(sorted_posts).

Don't know how this would scale to a system with millions of posts and/or tags.

Edit2: I meant set intersection, not union.

like image 43
Calvin Avatar answered Nov 15 '22 00:11

Calvin