Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Structure for storing a sorting field to efficiently allow modifications

I'm using Django and PostgreSQL, but I'm not absolutely tied to the Django ORM if there's a better way to do this with raw SQL or database specific operations.

I've got a model that needs sequential ordering. Lookup operations will generally retrieve the entire list in order. The most common operation on this data is to move a row to the bottom of a list, with a subset of the intervening items bubbling up to replace the previous item like this:

(operation on A, with subset B, C, E)

A -> B
B -> C
C -> E
D -> D
E -> A

Notice how D does not move.

In general, the subset of items will not be more than about 50 items, but the base list may grow to tens of thousands of entries.

The most obvious way of implementing this is with a simple integer order field. This seems suboptimal. It requires the compromise of making the position ordering column non-unique, where non-uniqueness is only required for the duration of a modification operation. To see this, imagine the minimal operation using A with subset B:

oldpos = B.pos
B.pos = A.pos
A.pos = oldpos

Even though you've stored the position, at the second line you've violated the uniqueness constraint. Additionally, this method makes atomicity problematic - your read operation has to happen before the write, during which time your records could change. Django's default transaction handling documentation doesn't address this, though I know it should be possible in the SQL using the "REPEATABLE READ" level of transaction locking.

I'm looking for alternate data structures that suit this use pattern more closely. I've looked at this question for ideas.

One proposal there is the Dewey decimal style solution, which makes insert operations occur numerically between existing values, so inserting A between B and C results in:

A=1   ->   B=2
B=2   ->   A=2.5
C=3   ->   C=3

This solves the column uniqueness problem, but introduces the issue that the column must be a float of a specified number of decimals. Either I over-estimate, and store way more data than I need to, or the system becomes limited by whatever arbitrary decimal length I impose. Furthermore, I don't expect use to be even over the database - some keys are going to be moved far more often than others, making this solution hit the limit sooner. I could solve this problem by periodically re-numbering the database, but it seems that a good data structure should avoid needing this.

Another structure I've considered is the linked list (and variants). This has the advantage of making modification straightforward, but I'm not certain of it's properties with respect to SQL - ordering such a list in the SQL query seems like it would be painful, and extracting a non-sequential subset of the list has terrible retrieval properties.

Beyond this, there are B-Trees, various Binary Trees, and so on. What do you recommend for this data structure? Is there a standard data structure for this solution in SQL? Is the initial idea of going with sequential integers really going to have scaling issues, or am I seeing problems where there are none?

like image 897
Paul McMillan Avatar asked Oct 28 '09 22:10

Paul McMillan


1 Answers

Prefered solutions:

A linked list would be the usual way to achieve this. A query to return the items in order is trivial in Oracle, but Im not sure how you would do it in PostreSQL.

Another option would be to implement this using the ltree module for postgresql.

Less graceful (and write-heavy) solution: Start transaction. "select for update" within scope for row level locks. Move the target record to position 0, update the targets future succeeding records to +1 where their position is higher than the targets original position (or vice versa) and then update the target to the new position - a single additional write over that needed without a unique constraint. Commit :D

Simple (yet still write-heavy) solution if you can wait for Postgresql 8.5 (Alpha is available) :)

Wrap it in a transaction, select for update in scope, and use a deferred constraint (postgresql 8.5 has support for deferred unique constraints like Oracle).

like image 199
Matt Avatar answered Sep 22 '22 07:09

Matt