How do databases perform atomic I/O?

Tags:

Databases like Oracle, SQL Server etc are very good at data integrity. If I wanted to write a data store that I knew would either store some data or fail (i.e. be ACID) then I would use a database like MySQL underneath it as the actual store because those problems are already solved.

However, as a non comp-sci graduate, I'm left wondering how ACID actually works at a very low level. I know that Oracle for example, writes data to "online redo logs" all the time, then performs a "commit" at some point when the app signals the transaction should be committed.

It's this "commit" stage that I want to zoom right in on and understand. Is it a case of just writing "one more byte" to disk, or flipping a 0 to a 1 to say a given row has been stored successfully?

543

asked Feb 09 '12 12:02

Neil Barnwell

1 Answers

I remember once finding the BerkeleyDB docs actually very useful to get an idea of how these implementations can work, because that is/was a quite low-level database that implemented transactions without the whole relational/query planning infrastructure.

Not all databases (even just the ones you mention) work in quite the same way. PostgreSQL's bottom-level implementation is quite different to both Oracle and SQL Server's snapshot implementation, even though they are all based on the same approach (MVCC: multi-version concurrency control).

One way to implement the ACID properties is to write all the changes that you ("you" here is some transaction making changes) make to the database into a "transaction log", as well as locking each row (unit of atomicity) to ensure no other transaction can mutate it at until you have either committed or rolled back. At the end of the transaction, if committing, you simply write a record into the log saying you committed and release the locks. If you roll back, you need to walk back through the transaction log undoing all your changes-- so each change written to the log file contains a "before image" of how the data looked originally. (In practice it will also contain an "after image" because transaction logs are replayed for crash recovery too). By locking each row you are changing, concurrent transactions do not see your changes until you release the locks after ending the transaction.

MVCC is a method by which concurrent transactions that want to read rows, rather than being blocked by you updating, can access the "before image" instead. Every transaction has an identity and has a way of determining which transactions' data it can "see" and which it can't: different rules to produce this set are used to implement different isolation levels. So to get "repeatable read" semantics, a transaction must find the "before image" for any row that was updated by a transaction that was started after it, for example. You could naively implement this by having transactions look back through the transaction log for the before images, but in practice they are stored somewhere else: hence Oracle has separate redo and undo spaces- redo is the transaction log, undo are before images of blocks for concurrent transactions to use; SQL Server stores the before images in tempdb. By contrast, PostgreSQL always creates a new copy of a row whenever it is updated, so the before images live in the data blocks themselves: this has some advantages (commit and rollback are both very simple operations, no additional space to manage) with tradeoffs (those outdated row versions have to be vacuumed up in the background).

In PostgreSQL's case (and this is the DB I'm most familiar with the internals of) each row version on disk has some additional properties the transactions must examine to decide if that row version is "visible" to them. For simplicity, consider that they have "xmin" and "xmax"- "xmin" specifies the transaction ID that created the row version, "xmax" the (optional) transaction ID that deleted it (which may include creating a new row version to represent an update to the row). So you start with a row created by txn#20:

xmin xmax id value
20   -    1  FOO

and then txn#25 performs update t set value = 'BAR' where id = 1

20   25   1  FOO
25   -    1  BAR

Until txn#25 is finished, new transactions will know to regard its changes as not visible. So a transaction scanning this table will take the "FOO" version, since its xmax is a non-visible transaction.

If txn#25 is rolled back, new transactions will not immediately skip it, but will consider whether txn#25 committed or rolled back. (PostgreSQL manages a "commit status" lookup table to serve this, pg_clog) Since txn#25 rolled back, its changes are not visible, so again the "FOO" version is taken. (And the "BAR" version is skipped since its xmin transaction is invisible)

If txn#25 is committed, then the "FOO" row version is now not taken, since its xmax transaction is visible (that is, the changes made by that transaction are now visible). By contrast, the "BAR" row version is taken, since its xmin transaction is visible (and it has no xmax)

While txn#25 is still in progress (again this can be read from pg_clog) any other transaction that wants to update the row will wait until txn#25 completes, by trying to take a shared lock on the transaction ID. I'm highlighting this point, it's why PostgreSQL doesn't usually have "row locks" as such, only transaction locks: there is no in-memory lock for each row changed. (Locking using select ... for update is done by setting xmax and a flag to indicate xmax just indicates locking not deletion)

Oracle... does something somewhat similar but my knowledge of the details are much hazier. In Oracle each transaction is issued a System Change Number, and that is recorded in the top of each block. When a block changes, its original contents are put in the undo space with the new block pointing at the old block: so you essentially have a linked list of versions of block N- the latest version in the data file, with the progressively older versions in the undo tablespace. And at the top of the block is a list of the "interested transactions" which somehow implements locking (again not having an in-memory lock for each row changed), and I can't remember the details beyond that.

SQL Server's snapshot isolation mechanism I believe is largely similar to Oracle's, using tempdb to store blocks that are being changed rather than a dedicated file.

Hope this rambling answer was useful. It's all from memory so large swathes of misinformation are possible, particularly for the non-postgresql implementations.

http://devcenter.heroku.com/articles/postgresql-concurrency
https://gist.github.com/1781649 - runnable demo of MVCC on PostgreSQL
http://simpledbm.googlecode.com/files/mvcc-survey-1.0.pdf

133

answered Sep 19 '22 00:09

araqnid

Related questions
                            
                                How to generate a diagram of a very large database schema (SQL Server)
                            
                                How to get column attributes query from table name using PostgreSQL?
                            
                                Is there such a thing as too many tables?
                            
                                Is Repository Singleton or Static or None of these?
                            
                                How are varchar values stored in a SQL Server database?
                            
                                Drop Column with foreign key in MySQL
                            
                                Best Way To Build A Multi-Notification System in PHP
                            
                                MySQL table with a varchar column as foreign key
                            
                                What is a "database entity" and what types of DBMS items are considered entities? [closed]
                            
                                Database Design for Facebook-like messages [closed]
                            
                                SQL Server NUMERIC/DECIMAL precision vs storage
                            
                                is EAV - Hybrid a bad database design choice
                            
                                How do you determine how far to normalize a database?
                            
                                How do you deal with m..n relationships in a relational database?
                            
                                Why (and how) to split column using master..spt_values?
                            
                                How to manage User Roles in a Database?
                            
                                Performance considerations for Triggers vs Constraints
                            
                                UK Vat change from 17.5 to 15% - How will this affect your code?
                            
                                Activity streams / feeds, to denormalize or not?
                            
                                Should MultiTenant Tables Include TenantID in Primary and Foreign Keys?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do databases perform atomic I/O?

Tags:

database-design

acid

Neil Barnwell

People also ask

1 Answers

araqnid

Recent Activity

Donate For Us