Why would you append a shard ID to a generated ID?

Tags:

I'm reading this: https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c

In in the last section "Solution", where they are generating a globally unique ID based on the DB's autoincrement feature + milliseconds since epoch + shard ID.

Why do we need to append shard ID to it?

Specifically, it says

Next, we take the shard ID for this particular piece of data we’re trying to insert. Let’s say we’re sharding by user ID, and there are 2000 logical shards; if our user ID is 31341, then the shard ID is 31341 % 2000 -> 1341. We fill the next 13 bits with this value

THis doesn't make sense: if you are already modding user ID by number of shards (31341 % 2000), that means 1) You already have user ID! 2) You already know the shard it belongs to with the mod function!

What am I misunderstanding here?

308

asked Apr 27 '19 17:04

user1008636

1 Answers

Maybe I can break it down for you a bit better, and it's not just because user-id wont fit.

They're using Twitter Snowflake ID. This was designed to generate a unique ID across multiple servers, across multiple data centers, in a parallel. For instance, at the same exact instant two "items" in two "places" need a guaranteed unique ID for anything at the same instant less than a millisecond apart, maybe even at the same nanosecond... This unique ID has the requirements of needing to be being extremely fast to produce, efficient, built in a logical way that can be parsed efficiently, can fit within 64 bits, and the method of generating it needs to be able to handle a HUGE amount if IDs over many peoples lifetimes. This means they cannot do DB lookups to get a unique ID that's not already taken, the can't verify that the generated ID is unique after generating it to be sure, and they couldn't use existing methods that could possibly generate duplicates even if very rarely like UUID. So they devised a way..

They set a custom common epoch, such at today in a long integer as a base point. So with this they have a 42 bit long integer that starts at 0+time since that epoch.

Then they also added a sequence as a 12 bit long integer in the case that a single process on a single machine had to generate 2 or more IDs in the same millisecond. Now they have 42+12=54 bits in use, and when your considering that multiple processes on multiple machines (normally only one machine per data center providing IDs, but could be more, and normally only one worker/process per machine) you realize that you need more than just 42+12..

So they also have to encode a data center ID and a "worker" (process) ID. This will cover multiple data centers with multiple workers in each data center. These two IDs are both 5 bit long integers. All these integers are unsigned, so these 5 bit integers can go up to 31 which give each of these partial IDs 32 possibilities including 0. So, 32 data centers, with up to 32 workers in each datacenter.. So now we're at 42+12+5+5=64bits, with up to 32x32=1024 workers producing these IDs distributed.

So.. With a lifetime up to 139 years of being able to fit in the 42 bit portion... 10 bits for a node ID (or data center+worker IDs)... a sequence of 12 bits (4096 IDs per millisecond per worker)... You come up with a 64 maximum guaranteed unique ID system/formula that scales amazingly well over those 139 years that doesn't rely on a database in any way but can be efficiently produced and stored in a database.

So, this ID system works out to 42+12+10 and you can divide those 10 bits up, or not, however you like and not go beyond storing a 64bit unsigned long integer anywhere. Very flexible, and works great.

Again, it's called a Snowflake ID and Twitter came up with it. Those 10 bits can be called a shard ID, node ID, or a combination of data center ID and worker ID, it really depends on your needs. But, by not tying that shard/node ID to a user but to multiple processes and being able to use that ID across multiple "things", you wont have to worry about a lot of things and you can span multiple databases full of multiple things and and and..

The one thing that does matter is that that shard/node ID can only hold 1024 different values and no user ID or any unique ID that they could use is just going to go from 0 to 1023 in they don't assign it themselves to whatever.

So you see, those 10 bits have to be something that's static, assignable and easily parse-able for them regardless.

Here's a simply python function that'll generate a snowflake ID:

def genSnowflakeId(worker_id, data_center_id, ids_generated):
    "Returns a snowflake ID - This function will generate a unique ID that fits in a 64 bit unsigned number that scales for multiple workers running in mutiple datacenters. You must manage a timestamp and sequence sanity with ids_generated (i.e. increment if time apart < 1 millisecond or always increment and roll over to 0 if > 4095). Ultimately this will allow you to efficiently generate unique IDs across multiple locations for 139 years that fits in a bigint(20) database field and can be parsed for the created timestamp, worker ID, and datacenter ID. See https://github.com/twitter-archive/snowflake/tree/snowflake-2010"

    import sys
    import time

    # Mon Jul  8 05:07:56 EDT 2019
    twepoch = 1562576876131L

    sequence = 0L
    worker_id_bits = 5L
    data_center_id_bits = 5L
    sequence_bits = 12L
    timestamp_bits = 42L
    #total bits 64

    max_worker_id = -1L ^ (-1L << worker_id_bits)
    max_data_center_id = -1L ^ (-1L << data_center_id_bits)
    max_ids_generated = -1L ^ (-1L << sequence_bits)

    worker_id_shift = sequence_bits
    data_center_id_shift = sequence_bits + worker_id_bits
    timestamp_left_shift = sequence_bits + worker_id_bits + data_center_id_bits
    sequence_mask = -1L ^ (-1L << sequence_bits)


    # Sanity checks for input
    if worker_id > max_worker_id or worker_id < 0:
        raise ValueError("worker_id", "worker id can't be greater than %i or less than 0" % max_worker_id)
    if data_center_id > max_data_center_id or data_center_id < 0:
        raise ValueError("data_center_id", "data center id can't be greater than %i or less than 0" % max_data_center_id)
    if ids_generated > max_ids_generated or ids_generated < 0:
        raise ValueError("ids_generated", "ids generated can't be greater than %i or less than 0" % max_ids_generated)

    timestamp = long(int(time.time() * 1000))

    new_id = ((timestamp - twepoch) << timestamp_left_shift) | (data_center_id << data_center_id_shift) | (worker_id << worker_id_shift) | sequence

    return new_id

Hope this answer satisfies ya :)

126

answered Nov 15 '22 05:11

J T

Related questions
                            
                                The mappings are inconsistent with each other
                            
                                Incorrect syntax near the keyword 'Table' C# SQL [closed]
                            
                                Get first AND last element with SQLAlchemy
                            
                                Specify foreign key on one column and the value of another column
                            
                                How to concatenate multiple rows into one field in sql server [duplicate]
                            
                                Migrate to datomic from postgres
                            
                                How to add where clause with unnest in sql query?
                            
                                UPDATE and SELECT a row in the same transaction
                            
                                Does sql update do a delete then insert?
                            
                                mongo C# remove multiple records by id
                            
                                DynamoDB primary key and indexes table design
                            
                                select random value based on probability chance
                            
                                How to create table to store json object data in PostgreSQL database?
                            
                                SPARQL query to get only results with the most recent date
                            
                                Update each row with incremental value Postgres
                            
                                How to backup Symmetric Key in SQL Server?
                            
                                What is the best way to store private data in react-native?
                            
                                Correct way of sending queries from Android to a remote server database
                            
                                User-Role-Permission based database design
                            
                                How do I import data from a csv to SequelPro?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why would you append a shard ID to a generated ID?

Tags:

database

facebook

sharding

id-generation

instagram

user1008636

People also ask

1 Answers

J T

Recent Activity

Donate For Us