SQLAlchemy, UUIDs, Sharding, and AUTO_INCREMENT primary key... how to get them to work together?

Tags:

I have a question pertaining to SQLAlchemy, database sharding, and UUIDs for you fine folks.

I'm currently using MySQL in which I have a table of the form:

CREATE TABLE foo (
    added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    id BINARY(16) NOT NULL,
    ... other stuff ...
    UNIQUE KEY(id)
);

A little background on this table. I never care about the 'added_id', I'm only using to ensure that the inserted items are clustered together on disk (since the B-Tree used to index the table in MySQL uses the primary key as the cluster index). The 'id' column contains the binary representation of a UUID -- this is the column I actually care about and all other things reference this ID. Again, I don't want the UUID to be the primary key, since the UUID is random and thus making the B-Tree created to index the table have horrible IO characteristics (at least that is what has been said elsewhere). Also, although UUID1 includes the timestamp to ensure that IDs are generated in "sequential" order, the inclusion of the MAC address in the ID makes it something I'd rather avoid. Thus, I'd like to use UUID4s.

Ok, now moving on to the SQLAlchemy part. In SQLAlchemy one can define a model using their ORM for the above table by doing something like:

# The SQL Alchemy ORM base class
Base = declerative_base()

# The model for table 'foo'
class Foo(Base):
    __table__ = 'foo'
    add_id = Column(Integer, primary_key=True, nullable=False)
    id = Column(Binary, index=True, unique=True, nullable=False)
    ...

Again, this is basically the same as the SQL above.

And now to the question. Let's say that this database is going to be sharded (horizontally partitioned) into 2 (or more) separate databases. Now, (assuming no deletions) each of these databases will have records with added_id of 1, 2, 3, etc in table foo. Since SQLAlchemy uses a session to manage the objects that are being worked on such that each object is identified only by its primary key, it seems like it would be possible to have the situation where I could end trying to access two Foo objects from the two shards with the same added_id resulting in some conflict in the managed session.

Has anyone run in to this issue? What have you done to solve it? Or, more than likely, am I missing something from the SQLAlchemy documentation that ensures that this cannot happen. However, looking at the sharding example provided with the SQLAlchemy download (examples/sharding/attribute_shard.py) they seem to side-step this issue by designating one of the database shards as an ID generator... creating an implicit bottle neck as all INSERTS have to go against that single database to get an ID. (They also mention using UUIDs, but apparently that causes the performance issue for the indexes.)

Alternatively, is there a way to set the UUID as the primary key and have the data be clustered on disk using the added_id? If it's not possible in MySQL is it possible in another DB like Postgres?

Thanks in advance for any and all input!

--- UPDATE ---- I just want to add an out of band answer that I received to this question. The following text isn't something I wrote, I just want to include it here in case someone finds it useful.

The easiest way to avoid that situation with MySQL and auto increment keys is to use different auto increment offsets for each database, e.g.:

ALTER TABLE foo AUTO_INCREMENT=100000;

The downside is that you need to take care in terms of how you configure each shard, and you need to plan a bit wrt the total number of shards you use.

There isn't any way to convince MySQL to use a non-primary key for the clustered index. If you don't care about using SQLAlchemy to manage your database schema (although, you probably should), you can simply set the UUID as the primary key in the SQLAlchemy schema and leave the add_id as the pk in the actual table.

I've also seen alternate solutions that simply use an external server (e.g. redis) to maintain the row id.

445

asked Oct 31 '12 21:10

prschmid

1 Answers

yes, you can specify any of the table's columns as the primary key for the purposes of the mapping using the "primary_key" mapper argument, which is a list of Column objects or a single Column:

Base = declarative_base()

# The model for table 'foo'
class Foo(Base):
    __table__ = 'foo'
    add_id = Column(Integer, primary_key=True, nullable=False)
    id = Column(Binary, index=True, unique=True, nullable=False)

    __mapper_args__ = {'primary_key': id}

Above, while the SQLAlchemy Core will treat "add_id" as the "autoincrement" column, the mapper will be mostly uninterested in it, instead using "id" as the column it cares about when considering the "identity" of the object.

See the documentation for mapper() for more description.

132

answered Sep 22 '22 09:09

zzzeek

Related questions
                            
                                How to find out if store open or close - dealing with hours?
                            
                                how to map ordered list in nhibernate?
                            
                                How to log connection pool data using BoneCP
                            
                                design very large database for searching text
                            
                                Access distributed mnesia database from different nodes
                            
                                Easy way to store metadata about MySQL-Database
                            
                                counting null values in sql with where and group by clause
                            
                                Intersystem Cache and MongoDB comparision
                            
                                How to write an "exclusive" query in SQL?
                            
                                codeigniter session custom data lost, but not on localhost
                            
                                Database timeout when running RSpec/Capybara with javascript driver
                            
                                read a .db file in C#
                            
                                MySQL fixing autoincrement gaps in two tables
                            
                                Need an efficient way to store/query json in a SQL database
                            
                                location of .db file in different android phones
                            
                                Database inheritance model with Node.js
                            
                                how to Create if not exists a new Java DB?
                            
                                mysql one-to-many query with negation and/or multiple criteria
                            
                                Databases and "branch"
                            
                                PHP - a DB abstraction layer use static class vs singleton object?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQLAlchemy, UUIDs, Sharding, and AUTO_INCREMENT primary key... how to get them to work together?

Tags:

performance

database

uuid

sqlalchemy

sharding

prschmid

People also ask

1 Answers

zzzeek

Recent Activity

Donate For Us