Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would you append a shard ID to a generated ID?

I'm reading this: https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c

In in the last section "Solution", where they are generating a globally unique ID based on the DB's autoincrement feature + milliseconds since epoch + shard ID.

Why do we need to append shard ID to it?

Specifically, it says

Next, we take the shard ID for this particular piece of data we’re trying to insert. Let’s say we’re sharding by user ID, and there are 2000 logical shards; if our user ID is 31341, then the shard ID is 31341 % 2000 -> 1341. We fill the next 13 bits with this value

THis doesn't make sense: if you are already modding user ID by number of shards (31341 % 2000), that means 1) You already have user ID! 2) You already know the shard it belongs to with the mod function!

What am I misunderstanding here?

like image 308
user1008636 Avatar asked Apr 27 '19 17:04

user1008636


People also ask

What does it mean to shard a database and why would you do it?

Sharding is a type of database partitioning that separates large databases into smaller, faster, more easily managed parts. These smaller parts are called data shards. The word shard means "a small part of a whole."

When should you shard a database?

Sharding is a great solution when the single database of your application is not capable to handle/store a huge amount of growing data. Sharding helps to scale the database and improve the performance of the application.

How does Instagram shard data?

To make sure all of our important data fits into memory and is available quickly for our users, we've begun to shard our data — in other words, place the data in many smaller buckets, each holding a part of the data. Our application servers run Django with PostgreSQL as our back-end database.

What is sharding and how is it implemented?

Oftentimes, sharding is implemented at the application level, meaning that the application includes code that defines which shard to transmit reads and writes to. However, some database management systems have sharding capabilities built in, allowing you to implement sharding directly at the database level.

Is it possible to modify the next_shard_ID to work like Instagram ID generator?

Modifying it to work exactly like the instagram id generator would involve updating the MAX_* constants and the bit shifting logic in next_shard_id and shard_id_to_ms. Thanks for contributing an answer to Stack Overflow!

How do I implement directory based sharding?

To implement directory based sharding, one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. A lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory based sharding:

What is a hash function in sharding?

A hash function is a function that takes as input a piece of data (for example, a customer email) and outputs a discrete value, known as a hash value. In the case of sharding, the hash value is a shard ID used to determine which shard the incoming data will be stored on. Altogether, the process looks like this:


1 Answers

Maybe I can break it down for you a bit better, and it's not just because user-id wont fit.

They're using Twitter Snowflake ID. This was designed to generate a unique ID across multiple servers, across multiple data centers, in a parallel. For instance, at the same exact instant two "items" in two "places" need a guaranteed unique ID for anything at the same instant less than a millisecond apart, maybe even at the same nanosecond... This unique ID has the requirements of needing to be being extremely fast to produce, efficient, built in a logical way that can be parsed efficiently, can fit within 64 bits, and the method of generating it needs to be able to handle a HUGE amount if IDs over many peoples lifetimes. This means they cannot do DB lookups to get a unique ID that's not already taken, the can't verify that the generated ID is unique after generating it to be sure, and they couldn't use existing methods that could possibly generate duplicates even if very rarely like UUID. So they devised a way..

They set a custom common epoch, such at today in a long integer as a base point. So with this they have a 42 bit long integer that starts at 0+time since that epoch.

Then they also added a sequence as a 12 bit long integer in the case that a single process on a single machine had to generate 2 or more IDs in the same millisecond. Now they have 42+12=54 bits in use, and when your considering that multiple processes on multiple machines (normally only one machine per data center providing IDs, but could be more, and normally only one worker/process per machine) you realize that you need more than just 42+12..

So they also have to encode a data center ID and a "worker" (process) ID. This will cover multiple data centers with multiple workers in each data center. These two IDs are both 5 bit long integers. All these integers are unsigned, so these 5 bit integers can go up to 31 which give each of these partial IDs 32 possibilities including 0. So, 32 data centers, with up to 32 workers in each datacenter.. So now we're at 42+12+5+5=64bits, with up to 32x32=1024 workers producing these IDs distributed.

So.. With a lifetime up to 139 years of being able to fit in the 42 bit portion... 10 bits for a node ID (or data center+worker IDs)... a sequence of 12 bits (4096 IDs per millisecond per worker)... You come up with a 64 maximum guaranteed unique ID system/formula that scales amazingly well over those 139 years that doesn't rely on a database in any way but can be efficiently produced and stored in a database.

So, this ID system works out to 42+12+10 and you can divide those 10 bits up, or not, however you like and not go beyond storing a 64bit unsigned long integer anywhere. Very flexible, and works great.

Again, it's called a Snowflake ID and Twitter came up with it. Those 10 bits can be called a shard ID, node ID, or a combination of data center ID and worker ID, it really depends on your needs. But, by not tying that shard/node ID to a user but to multiple processes and being able to use that ID across multiple "things", you wont have to worry about a lot of things and you can span multiple databases full of multiple things and and and..

The one thing that does matter is that that shard/node ID can only hold 1024 different values and no user ID or any unique ID that they could use is just going to go from 0 to 1023 in they don't assign it themselves to whatever.

So you see, those 10 bits have to be something that's static, assignable and easily parse-able for them regardless.

Here's a simply python function that'll generate a snowflake ID:

def genSnowflakeId(worker_id, data_center_id, ids_generated):
    "Returns a snowflake ID - This function will generate a unique ID that fits in a 64 bit unsigned number that scales for multiple workers running in mutiple datacenters. You must manage a timestamp and sequence sanity with ids_generated (i.e. increment if time apart < 1 millisecond or always increment and roll over to 0 if > 4095). Ultimately this will allow you to efficiently generate unique IDs across multiple locations for 139 years that fits in a bigint(20) database field and can be parsed for the created timestamp, worker ID, and datacenter ID. See https://github.com/twitter-archive/snowflake/tree/snowflake-2010"

    import sys
    import time

    # Mon Jul  8 05:07:56 EDT 2019
    twepoch = 1562576876131L

    sequence = 0L
    worker_id_bits = 5L
    data_center_id_bits = 5L
    sequence_bits = 12L
    timestamp_bits = 42L
    #total bits 64

    max_worker_id = -1L ^ (-1L << worker_id_bits)
    max_data_center_id = -1L ^ (-1L << data_center_id_bits)
    max_ids_generated = -1L ^ (-1L << sequence_bits)

    worker_id_shift = sequence_bits
    data_center_id_shift = sequence_bits + worker_id_bits
    timestamp_left_shift = sequence_bits + worker_id_bits + data_center_id_bits
    sequence_mask = -1L ^ (-1L << sequence_bits)


    # Sanity checks for input
    if worker_id > max_worker_id or worker_id < 0:
        raise ValueError("worker_id", "worker id can't be greater than %i or less than 0" % max_worker_id)
    if data_center_id > max_data_center_id or data_center_id < 0:
        raise ValueError("data_center_id", "data center id can't be greater than %i or less than 0" % max_data_center_id)
    if ids_generated > max_ids_generated or ids_generated < 0:
        raise ValueError("ids_generated", "ids generated can't be greater than %i or less than 0" % max_ids_generated)

    timestamp = long(int(time.time() * 1000))

    new_id = ((timestamp - twepoch) << timestamp_left_shift) | (data_center_id << data_center_id_shift) | (worker_id << worker_id_shift) | sequence

    return new_id

Hope this answer satisfies ya :)

like image 126
J T Avatar answered Nov 15 '22 05:11

J T