Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sequential GUID generation in Java with SQL Server uniqueidentifier

The problem that I'm trying to solve is:

1 - In our DB we have all the tables (also tables with millions of records) with a PK id column declared as VARCHAR(36). There is also a clustered index on it This of course as I read online it's an awful thing for performance, also because the db has a lot of reads, inserts, updates and deletes.

2 - We use Hibernate for our java web application as an ORM to this db

After extensive reading online I started changing the data type of those columns to UNIQUEIDENTIFIER with the default option newsequentialid() as this option should mitigate the fragmentation issues of our indexes.

What I noted is that the fragmentation issue persisted, the tables became heavily fragmented short after rebuilds (we do a full index rebuild every night).

Then I saw that all of our Hibernate mappings for the id columns consisted of this:

<id name="id" column="id" type="string">
        <generator class="guid"/>
</id>

When an insert occurred in our system the log showed that the insert was done after calling select newid(), so since this returns a random guid the insert would be placed in a random point in the index, thus causing fragmentation (this totally defeat the column data type change I did also).

So after another online search I tried to implement a guid generator in Hibernate by myself, implementing the interface IdentifierGenerator and using a time based generator with JUG (http://wiki.fasterxml.com/JugHome).

The code that generates the (I thought sequential) id is this:

String uuid = null;   
EthernetAddress nic = EthernetAddress.fromInterface();    
TimeBasedGenerator uuidGenerator = Generators.timeBasedGenerator(nic);                
uuid = uuidGenerator.generate().toString();

And I changed accordingly the mapping to this:

<id name="id" column="id" type="string">
            <generator class="my_package.hibernate.CustomSequentialGuidGenerator">
            </generator>
</id>

Then I tried to generate some test uuids to test their sequentiality (sequential in the uniqueidentifier way, so binary), this is a short list (every element is generated before the successive) :

314a9a1b-6295-11e5-8d2c-2c27d7e1614f
3d867801-6295-11e5-ae09-2c27d7e1614f
4434ac7d-6295-11e5-9ed1-2c27d7e1614f
491462c4-6295-11e5-af81-2c27d7e1614f
5389ff4c-6295-11e5-84cf-2c27d7e1614f
57098959-6295-11e5-b203-2c27d7e1614f
5b62d144-6295-11e5-9883-2c27d7e1614f

This look to me as alphabetically sequential, but not binary sequential.

The above test was done executing seven times the test application, it wasn't a loop.

I tried to insert those values in a column declared as unique identifier and after issuing a select on this column this is the list sql server outputs:

5389FF4C-6295-11E5-84CF-2C27D7E1614F
314A9A1B-6295-11E5-8D2C-2C27D7E1614F
5B62D144-6295-11E5-9883-2C27D7E1614F
4434AC7D-6295-11E5-9ED1-2C27D7E1614F
3D867801-6295-11E5-AE09-2C27D7E1614F
491462C4-6295-11E5-AF81-2C27D7E1614F
57098959-6295-11E5-B203-2C27D7E1614F

So I really don't understand what I should do and if I can use JUG as a sequential guid generator to avoid my fragmentation issues.

This is another JUG test, I tried 3 runs each time generating 10 guids with a loop:

Run 1

54bd156e-62a2-11e5-a1a7-2c27d7e1614f
54c3cc2f-62a2-11e5-a1a7-2c27d7e1614f
54caf820-62a2-11e5-a1a7-2c27d7e1614f
54d1aee1-62a2-11e5-a1a7-2c27d7e1614f
54d901e2-62a2-11e5-a1a7-2c27d7e1614f
54df9193-62a2-11e5-a1a7-2c27d7e1614f
54e64854-62a2-11e5-a1a7-2c27d7e1614f
54ecff15-62a2-11e5-a1a7-2c27d7e1614f
54f3b5d6-62a2-11e5-a1a7-2c27d7e1614f
54fa4587-62a2-11e5-a1a7-2c27d7e1614f

Run 2

87c66bcc-62a2-11e5-8e7c-2c27d7e1614f
87ccd46d-62a2-11e5-8e7c-2c27d7e1614f
87d3641e-62a2-11e5-8e7c-2c27d7e1614f
87d97e9f-62a2-11e5-8e7c-2c27d7e1614f
87e05c70-62a2-11e5-8e7c-2c27d7e1614f
87e6ec21-62a2-11e5-8e7c-2c27d7e1614f
87ed7bd2-62a2-11e5-8e7c-2c27d7e1614f
87f40b83-62a2-11e5-8e7c-2c27d7e1614f
87fac244-62a2-11e5-8e7c-2c27d7e1614f
880103d5-62a2-11e5-8e7c-2c27d7e1614f

Run 3

a4b690db-62a2-11e5-b667-2c27d7e1614f
a4bcd26c-62a2-11e5-b667-2c27d7e1614f
a4c2eced-62a2-11e5-b667-2c27d7e1614f
a4c92e7e-62a2-11e5-b667-2c27d7e1614f
a4cf48ff-62a2-11e5-b667-2c27d7e1614f
a4d5d8b0-62a2-11e5-b667-2c27d7e1614f
a4dc6861-62a2-11e5-b667-2c27d7e1614f
a4e34632-62a2-11e5-b667-2c27d7e1614f
a4e9d5e3-62a2-11e5-b667-2c27d7e1614f
a4f101d4-62a2-11e5-b667-2c27d7e1614f

Run 4

c2b872b2-62a2-11e5-b855-2c27d7e1614f
c2c17363-62a2-11e5-b855-2c27d7e1614f
c2c82a24-62a2-11e5-b855-2c27d7e1614f
c2ce92c5-62a2-11e5-b855-2c27d7e1614f
c2d57096-62a2-11e5-b855-2c27d7e1614f
c2dc2757-62a2-11e5-b855-2c27d7e1614f
c2e32c38-62a2-11e5-b855-2c27d7e1614f
c2e9bbe9-62a2-11e5-b855-2c27d7e1614f
c2f099ba-62a2-11e5-b855-2c27d7e1614f
c2f7507b-62a2-11e5-b855-2c27d7e1614f

Run 5

f0263d1b-62a2-11e5-8529-2c27d7e1614f
f02d1aec-62a2-11e5-8529-2c27d7e1614f
f033d1ad-62a2-11e5-8529-2c27d7e1614f
f03a615e-62a2-11e5-8529-2c27d7e1614f
f041181f-62a2-11e5-8529-2c27d7e1614f
f047a7d0-62a2-11e5-8529-2c27d7e1614f
f04dc251-62a2-11e5-8529-2c27d7e1614f
f05403e2-62a2-11e5-8529-2c27d7e1614f
f05a6c83-62a2-11e5-8529-2c27d7e1614f
f0608704-62a2-11e5-8529-2c27d7e1614f

Run 6 (Started from 0 again)

00fd4ec3-62a3-11e5-8ab8-2c27d7e1614f
01042c94-62a3-11e5-8ab8-2c27d7e1614f
010b3175-62a3-11e5-8ab8-2c27d7e1614f
0111e836-62a3-11e5-8ab8-2c27d7e1614f
0118ed17-62a3-11e5-8ab8-2c27d7e1614f
011fcae8-62a3-11e5-8ab8-2c27d7e1614f
0126a8b9-62a3-11e5-8ab8-2c27d7e1614f
012d115a-62a3-11e5-8ab8-2c27d7e1614f
0133c81b-62a3-11e5-8ab8-2c27d7e1614f
013a30bc-62a3-11e5-8ab8-2c27d7e1614f

The single groups are alphabetically (but not binary) ordered and taken the different runs as a whole they aren't event alphabetically ordered (sigh).

What am I missing?

************************* EDIT - Description of my implementation ******************

After the various comments and answers I implemented the following strategy:

I generated my own sequential (based on current timestamp) guids and this is the generator class:

package it.hibernate;

import java.io.Serializable;
import java.text.SimpleDateFormat;
import java.util.Date;

import org.apache.commons.lang.RandomStringUtils;
import org.hibernate.HibernateException;
import org.hibernate.engine.SessionImplementor;
import org.hibernate.id.IdentifierGenerator;



public class CustomSequentialGuidGenerator implements IdentifierGenerator{


    @Override
    public Serializable generate(SessionImplementor session, Object object)
            throws HibernateException 
        {

        String uuid = null;
        try {

            Date data = new Date();

             SimpleDateFormat sdf = new SimpleDateFormat(); 
             String rand = RandomStringUtils.randomAlphanumeric(12);

             sdf.applyPattern("yyyy");
             String year = sdf.format(data);

             sdf.applyPattern("MM");
             String month = sdf.format(data);

             sdf.applyPattern("dd");
             String day = sdf.format(data);

             sdf.applyPattern("HH");
             String hour = sdf.format(data);

             sdf.applyPattern("mm");
             String mins = sdf.format(data);

             sdf.applyPattern("ss");
             String secs = sdf.format(data);

             sdf.applyPattern("SSS");
             String millis = sdf.format(data);

             //G carachter is used to insert the rows after
             uuid = "GG" + year + month + "-" + day + hour + "-" + mins + secs + "-" + "0" + millis + "-" + rand;


        } 
        catch (Exception exception) 
        {
            exception.printStackTrace();                
        }

        return uuid;
    }
}

You can note that all rows start with the string 'GG' because I had to make sure that all new rows would be inserted AFTER all the old rows generated via select newid(). After that there's the current timestamp and 12 random chars to avoid collision in case of a multiple row insert in the same millisecond.

After a test of 2000 insert the primary key index fragmentation dropped from 17,92% to 0,15%.

N.B. The data type I reintroduced is obviously a varchar(36) again and not a uniqueidentifier, so the rows are sorted alphabetically.

like image 428
frankieta Avatar asked Sep 24 '15 10:09

frankieta


1 Answers

The default option of newsequentialid() of course did not work, because hibernate does not use the default, it always sets a value issued by its generator.

By taking a quick look at the JUG library, it appears that it does not offer any means of generating GUIDs sequentially. I do not know why you thought that the generate() method of the generator obtained via Generators.timeBasedGenerator() would give you sequential GUIDs. A time based generator is simply a generator which takes the current time into account when generating GUIDs, but it is free to mangle the current time coordinate in any way it sees fit when embedding it into the GUID, so it does not guarantee that there will be anything sequential about the resulting GUIDs.

Generally, the terms "GUID" and "sequential" are incompatible with each other. You can either have keys that are GUIDs, or keys that are sequential, but under normal circumstances, you cannot have both.

So, are you sure that the keys must be GUIDs? Personally, I find GUIDs very hard to work with.

But if you must do any hacks necessary as to have sequential GUIDs, then my recommendation would be to write your own function which generates 36-character strings that look like GUIDs, but are sequential.

The sequential part should come from a SEQUENCE, which simply issues sequential integers. (I believe MS-SQL-Server supports them.)

You can read the IETF's UUID specification on how to construct GUIDs properly, but you do not have to follow it to the letter. For the most part, if it simply looks like a GUID, it is good enough.

If you can have a single global sequence for this, that's good. If you cannot have a single global sequence, then you need to somehow identify your sequences, and then take the identifier of each sequence into account when generating your GUIDs. (That would be the "node id" mentioned in the IETF documentation.)

I once had the unreasonable requirement that the rows that I was to transmit to a certain web service had to be identified by GUIDs, and there was too much red tape that was preventing me from contacting them to ask them "are you friggin' serious?" so I just transmitted GUIDs like the following:

|--- random part -----| |-- key ---|
314a9a1b-6295-11e5-8d2c-000000000001
314a9a1b-6295-11e5-8d2c-000000000002
314a9a1b-6295-11e5-8d2c-000000000003
314a9a1b-6295-11e5-8d2c-000000000004
314a9a1b-6295-11e5-8d2c-000000000005
...

They did not say a word.

like image 185
Mike Nakis Avatar answered Oct 17 '22 00:10

Mike Nakis