Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proposed solution: Generate unique IDs in a distributed environment

I've been browsing the net trying to find a solution that will allow us to generate unique IDs in a regionally distributed environment.

I looked at the following options (among others):

SNOWFLAKE (by Twitter)

  • It seems like a great solutions, but I just don't like the added complexity of having to manage another software just to create IDs;
  • It lacks documentation at this stage, so I don't think it will be a good investment;
  • The nodes need to be able to communicate to one another using Zookeeper (what about latency / communication failure?)

UUID

  • Just look at it: 550e8400-e29b-41d4-a716-446655440000;
  • Its a 128 bit ID;
  • There has been some known collisions (depending on the version I guess) see this post.

AUTOINCREMENT IN RELATIONAL DATABASE LIKE MYSQL

  • This seems safe, but unfortunately, we are not using relational databases (scalability preferences);
  • We could deploy a MySQL server for this like what Flickr does, but again, this introduces another point of failure / bottleneck. Also added complexity.

AUTOINCREMENT IN A NON-RELATIONAL DATABASE LIKE COUCHBASE

  • This could work since we are using Couchbase as our database server, but;
  • This will not work when we have more than one clusters in different regions, latency issues, network failures: At some point, IDs will collide depending on the amount of traffic;

MY PROPOSED SOLUTION (this is what I need help with)

Lets say that we have clusters consisting of 10 Couchbase Nodes and 10 Application nodes in 5 different regions (Africa, Europe, Asia, America and Oceania). This is to ensure that content is served from a location closest to the user (to boost speed) and to ensure redundancy in case of disasters etc.

Now, the task is to generate IDs that wont collide when the replication (and balancing) occurs and I think this can be achieved in 3 steps:

Step 1

All regions will be assigned integer IDs (unique identifiers):

  • 1 - Africa;
  • 2 - America;
  • 3 - Asia;
  • 4 - Europe;
  • 5 - Ociania.

Step 2

Assign an ID to every Application node that is added to the cluster keeping in mind that there may be up to 99 999 servers in one cluster (even though I doubt: just as a safely precaution). This will look something like this (fake IPs):

  • 00001 - 192.187.22.14
  • 00002 - 164.254.58.22
  • 00003 - 142.77.22.45
  • and so forth.

Please note that all of these are in the same cluster, so that means you can have node 00001 per region.

Step 3

For every record inserted into the database, an incremented ID will be used to identify it, and this is how it will work:

Couchbase offers an increment feature that we can use to create IDs internally within the cluster. To ensure redundancy, 3 replicas will be created within the cluster. Since these are in the same place, I think it should be safe to assume that unless the whole cluster is down, one of the nodes responsible for this will be available, otherwise a number of replicas can be increased.

Bringing it all together

Say a user is signing up from Europe: The application node serving the request will grab the region code (4 in this case), get its own ID (say 00005) and then get an incremented ID (1) from Couchbase (from the same cluster).

We end up with 3 components: 4, 00005,1. Now, to create an ID from this, we can just join these components into 4.00005.1. To make it even better (I'm not too sure about this), we can concatenate (not add them up) the components to end up with: 4000051.

In code, this will look something like this:

$id = '4'.'00005'.'1';

NB: Not $id = 4+00005+1;.

Pros

  • IDs look better than UUIDs;
  • They seem unique enough. Even if a node in another region generated the same incremented ID and has the same node ID as the one above, we always have the region code to set them apart;
  • They can still be stored as integers (probably Big Unsigned integers);
  • It's all part of the architecture, no added complexities.

Cons

  • No sorting (or is there)?
  • This is where I need your input (most)

I know that every solution has flaws, and possibly more that what we see on the surface. Can you spot any issues with this whole approach?

Thank you in advance for your help :-)

EDIT

As @DaveRandom suggested, we can add the 4th step:

Step 4

We can just generate a random number and append it to the ID to prevent predictability. Effectively, you end up with something like this:

4000051357 instead of just 4000051.

like image 272
Sthe Avatar asked Aug 15 '13 08:08

Sthe


1 Answers

I think this looks pretty solid. Each region maintains consistency, and if you use XDCR there are no collisions. INCR is atomic within a cluster, so you will have no issues there. You don't actually need to have the Machine code part of it. If all the app servers within a region are connected to the same cluster, it's irrelevant to infix the 00001 part of it. If that is useful for you for other reasons (some sort of analytics) then by all means, but it isn't necessary.

So it can simply be '4' . 1' (using your example)

Can you give me an example of what kind of "sorting" you need?

First: One downside of adding entropy (and I am not sure why you would need it), is you cannot iterate over the ID collection as easily.

For Example: If you ID's from 1-100, which you will know from a simple GET query on the Counter key, you could assign tasks by group, this task takes 1-10, the next 11-20 and so on, and workers can execute in parallel. If you add entropy, you will need to use a Map/Reduce View to pull the collections down, so you are losing the benefit of a key-value pattern.

Second: Since you are concerned with readability, it can be valuable to add a document/object type identifier as well, and this can be used in Map/Reduce Views (or you can use a json key to identify that).

Ex: 'u:' . '4' . '1'

If you are referring to ID's externally, you might want to obscure in other ways. If you need an example, let me know and I can append my answer with something you could do.

@scalabl3

like image 110
scalabl3 Avatar answered Nov 11 '22 11:11

scalabl3