Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLAlchemy, UUIDs, Sharding, and AUTO_INCREMENT primary key... how to get them to work together?

I have a question pertaining to SQLAlchemy, database sharding, and UUIDs for you fine folks.

I'm currently using MySQL in which I have a table of the form:

CREATE TABLE foo (
    added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    id BINARY(16) NOT NULL,
    ... other stuff ...
    UNIQUE KEY(id)
);

A little background on this table. I never care about the 'added_id', I'm only using to ensure that the inserted items are clustered together on disk (since the B-Tree used to index the table in MySQL uses the primary key as the cluster index). The 'id' column contains the binary representation of a UUID -- this is the column I actually care about and all other things reference this ID. Again, I don't want the UUID to be the primary key, since the UUID is random and thus making the B-Tree created to index the table have horrible IO characteristics (at least that is what has been said elsewhere). Also, although UUID1 includes the timestamp to ensure that IDs are generated in "sequential" order, the inclusion of the MAC address in the ID makes it something I'd rather avoid. Thus, I'd like to use UUID4s.

Ok, now moving on to the SQLAlchemy part. In SQLAlchemy one can define a model using their ORM for the above table by doing something like:

# The SQL Alchemy ORM base class
Base = declerative_base()

# The model for table 'foo'
class Foo(Base):
    __table__ = 'foo'
    add_id = Column(Integer, primary_key=True, nullable=False)
    id = Column(Binary, index=True, unique=True, nullable=False)
    ...

Again, this is basically the same as the SQL above.

And now to the question. Let's say that this database is going to be sharded (horizontally partitioned) into 2 (or more) separate databases. Now, (assuming no deletions) each of these databases will have records with added_id of 1, 2, 3, etc in table foo. Since SQLAlchemy uses a session to manage the objects that are being worked on such that each object is identified only by its primary key, it seems like it would be possible to have the situation where I could end trying to access two Foo objects from the two shards with the same added_id resulting in some conflict in the managed session.

Has anyone run in to this issue? What have you done to solve it? Or, more than likely, am I missing something from the SQLAlchemy documentation that ensures that this cannot happen. However, looking at the sharding example provided with the SQLAlchemy download (examples/sharding/attribute_shard.py) they seem to side-step this issue by designating one of the database shards as an ID generator... creating an implicit bottle neck as all INSERTS have to go against that single database to get an ID. (They also mention using UUIDs, but apparently that causes the performance issue for the indexes.)

Alternatively, is there a way to set the UUID as the primary key and have the data be clustered on disk using the added_id? If it's not possible in MySQL is it possible in another DB like Postgres?

Thanks in advance for any and all input!

--- UPDATE ---- I just want to add an out of band answer that I received to this question. The following text isn't something I wrote, I just want to include it here in case someone finds it useful.

The easiest way to avoid that situation with MySQL and auto increment keys is to use different auto increment offsets for each database, e.g.:

ALTER TABLE foo AUTO_INCREMENT=100000;

The downside is that you need to take care in terms of how you configure each shard, and you need to plan a bit wrt the total number of shards you use.

There isn't any way to convince MySQL to use a non-primary key for the clustered index. If you don't care about using SQLAlchemy to manage your database schema (although, you probably should), you can simply set the UUID as the primary key in the SQLAlchemy schema and leave the add_id as the pk in the actual table.

I've also seen alternate solutions that simply use an external server (e.g. redis) to maintain the row id.

like image 445
prschmid Avatar asked Oct 31 '12 21:10

prschmid


People also ask

Is it OK to use UUID as primary key?

UUIDs as primary key aren't a slam drunk, but do have some advantages: The fact that they're random means that they don't rely on a single sequence to be generated. Multiple entities can generate IDs independently, but still store them to a shared data store without clobbering each other.

Can UUID be auto increment?

UUID always occupies 16 bytes. For Auto Increment Integer, when stored as Long format, it occupies 8 bytes. If the table itself has only a few columns, the extra primary key space overhead will become more significant.

Should we use UUID as ID?

By using UUIDs, you ensure that your ID is not just unique in the context of a single database table or web application, but is truly unique in the universe. No other ID in existence should be the same as yours.

What is DB UUID?

A universally unique identifier (UUID) is a 128-bit number used to identify information in computer systems. You can create a UUID and use it to uniquely identify something.


1 Answers

yes, you can specify any of the table's columns as the primary key for the purposes of the mapping using the "primary_key" mapper argument, which is a list of Column objects or a single Column:

Base = declarative_base()

# The model for table 'foo'
class Foo(Base):
    __table__ = 'foo'
    add_id = Column(Integer, primary_key=True, nullable=False)
    id = Column(Binary, index=True, unique=True, nullable=False)

    __mapper_args__ = {'primary_key': id}

Above, while the SQLAlchemy Core will treat "add_id" as the "autoincrement" column, the mapper will be mostly uninterested in it, instead using "id" as the column it cares about when considering the "identity" of the object.

See the documentation for mapper() for more description.

like image 132
zzzeek Avatar answered Sep 22 '22 09:09

zzzeek