Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django - short non-linear non-predictable ID in the URL

I know there are similar questions (like this, this, this and this) but I have specific requirements and looking for a less-expensive way to do the following (on Django 1.10.2):

Looking to not have sequential/guessable integer ids in the URLs and ideally meet the following requirements:

  • Avoid UUIDs since that makes the URL really long.
  • Avoid a custom primary key. It doesn’t seem to work well if the models have ManyToManyFields. Got affected by at least three bugs while trying that (#25012, #24030 and #22997), including messing up the migrations and having to delete the entire db and recreating the migrations (well, lots of good learning too)
  • Avoid checking for collisions if possible (hence avoid a db lookup for every insert)
  • Don’t just want to look up by the slug since it’s less performant than just looking up an integer id.
  • Don’t care too much about encrypting the id - just don’t want it to be a visibly sequential integer.

Note: The app would likely have 5 million records or so in the long term.

like image 803
Anupam Avatar asked Apr 06 '17 11:04

Anupam


1 Answers

After researching a lot of options on SO, blogs etc., I ended up doing the following:

  • Encoding the id to base32 only for the URLs and decoding it back in urls.py (using an edited version of Django’s util functions to encode to base 36 since I needed uppercase letters instead of lowercase).
  • Not storing the encoded id anywhere. Just encoding and decoding everytime on the fly.
  • Keeping the default id intact and using it as primary key.

(good hints, posts and especially this comment helped a lot)

What this solution helps achieve:

  1. Absolutely no edits to models or post_save signals.
  2. No collision checks needed. Avoiding one extra request to the db.
  3. Lookup still happens on the default id which is fast. Also, no double save()requests on the model for every insert.
  4. Short and sweet encoded ID (the number of characters go up as the number of records increase but still not very long)

What it doesn’t help achieve/any drawbacks:

  1. Encryption - the ID is encoded but not encrypted, so the user may still be able to figure out the pattern to get to the id (but I dont care about it much, as mentioned above).
  2. A tiny overhead of encoding and decoding on each URL construction/request but perhaps that’s better than collision checks and/or multiple save() calls on the model object for insertions.

For reference, looks like there are multiple ways to generate random IDs that I discovered along the way (like Django’s get_random_string, Python’s random, Django’s UUIDField etc.) and many ways to encode the current ID (base 36, base 62, XORing, and what not). The encoded ID can also be stored as another (indexed) field and looked up every time (like here) but depends on the performance parameters of the web app (since looking up a varchar id is less performant that looking up an integer id). This identifier field can either be saved from a overwritten model’s save() function, or by using a post_save() signal (see here) (while both approaches will need the save() function to be called twice for every insert).

All ears to optimizations to the above approach. I love SO and the community. Everytime there’s so much to learn here.

Update: After more than a year of this post, I found this great library called hashids which does pretty much the same thing quite well! Its available in many languages including Python.

like image 121
Anupam Avatar answered Nov 08 '22 07:11

Anupam