I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
GUIDs can be considered as global primary keys. Local primary keys are used to uniquely identify records within a table. On the other hand, GUIDs can be used to uniquely identify records across tables, databases, and servers.
A GUID is a unique number that can be used as an identifier for anything in the universe, but unlike ISBN there is no central authority - the uniqueness of a GUID relies on the algorthm that was used to generate it.
Users do not need to rely on a centralized authority to administer GUIDs, as anyone can use a generation algorithm to create a GUID. Individuals and organizations can create GUIDs using a free GUID generator that is available online. An online generator constructs a unique GUID according to RFC 4122.
(Globally Unique IDentifier) An implementation of the universally unique ID (see UUID) that is computed by Windows and Windows applications. Using a pseudo-random 128-bit number, GUIDs are used to identify user accounts, documents, software, hardware, software interfaces, sessions, database keys and other items.
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
You could use a long
(e.g. 275001120966638272
) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
One solution to the partitioning problem with long
ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
You can use ShortGuid which encodes a GUID
as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw
) and it's case sensitive which may not be good for URL's if you are lower-casing them.
There is also base32 encoding of GUID
's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4
) but the advantage is that they can be all lower case.
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1
is unique to the user some-user
and could be the number of posts they've made i.e. their next post would be 2
. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys is the increasing numbering sequence of the write order, This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Yes, it's a good practice.
This article Generating Globally Unique Identifiers for Use with MongoDB by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy high enough to ensure a practical uniqueness of labels. They can be represented by a 32 hex character strings. Enough to write several thousands of billions of billions of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
(h)
,
followed by a user number (u)
and time (t)
along a write index (i)
guarantee the PK huti
to stay unique ?Now considering the DB system:
The hashing technique of Youtube is hashids.
It's a good choice : the hash are shorts and the length can be controlled, the alphabet can be customized, it is reversible (and as such interesting as short reference to the primary keys), it can use salt. it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe. To anticipate on the colliding scenario, calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970 however. If you are only interested in keeping recent records close to each other, you only need enough values to ensure that you don’t have more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With