Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do databases perform atomic I/O?

Databases like Oracle, SQL Server etc are very good at data integrity. If I wanted to write a data store that I knew would either store some data or fail (i.e. be ACID) then I would use a database like MySQL underneath it as the actual store because those problems are already solved.

However, as a non comp-sci graduate, I'm left wondering how ACID actually works at a very low level. I know that Oracle for example, writes data to "online redo logs" all the time, then performs a "commit" at some point when the app signals the transaction should be committed.

It's this "commit" stage that I want to zoom right in on and understand. Is it a case of just writing "one more byte" to disk, or flipping a 0 to a 1 to say a given row has been stored successfully?

like image 543
Neil Barnwell Avatar asked Feb 09 '12 12:02

Neil Barnwell


People also ask

How do databases ensure atomicity?

Implementation. Typically, systems implement Atomicity by providing some mechanism to indicate which transactions have started and which finished; or by keeping a copy of the data before any changes occurred (read-copy-update).

What does it mean for a database to be atomic?

Atomicity is a feature of databases systems dictating where a transaction must be all-or-nothing. That is, the transaction must either fully happen, or not happen at all. It must not complete partially.

Why is atomicity important in database?

Atomicity is important for applications wanting to update two related databases (for example, a primary database and secondary index) in a single logical action. Or, for an application wanting to update multiple records in one database in a single logical action.

What is atomicity in DBMS with example?

Atomicity is also known as the all-or-nothing rule. And also it can be defined as the entire transaction that should be done at a time or doesn't take place. Consider an example that there are two transactions A and B. A has 200$ and B has 500$, suppose A wants to transfer 100$ to account B.


1 Answers

I remember once finding the BerkeleyDB docs actually very useful to get an idea of how these implementations can work, because that is/was a quite low-level database that implemented transactions without the whole relational/query planning infrastructure.

Not all databases (even just the ones you mention) work in quite the same way. PostgreSQL's bottom-level implementation is quite different to both Oracle and SQL Server's snapshot implementation, even though they are all based on the same approach (MVCC: multi-version concurrency control).

One way to implement the ACID properties is to write all the changes that you ("you" here is some transaction making changes) make to the database into a "transaction log", as well as locking each row (unit of atomicity) to ensure no other transaction can mutate it at until you have either committed or rolled back. At the end of the transaction, if committing, you simply write a record into the log saying you committed and release the locks. If you roll back, you need to walk back through the transaction log undoing all your changes-- so each change written to the log file contains a "before image" of how the data looked originally. (In practice it will also contain an "after image" because transaction logs are replayed for crash recovery too). By locking each row you are changing, concurrent transactions do not see your changes until you release the locks after ending the transaction.

MVCC is a method by which concurrent transactions that want to read rows, rather than being blocked by you updating, can access the "before image" instead. Every transaction has an identity and has a way of determining which transactions' data it can "see" and which it can't: different rules to produce this set are used to implement different isolation levels. So to get "repeatable read" semantics, a transaction must find the "before image" for any row that was updated by a transaction that was started after it, for example. You could naively implement this by having transactions look back through the transaction log for the before images, but in practice they are stored somewhere else: hence Oracle has separate redo and undo spaces- redo is the transaction log, undo are before images of blocks for concurrent transactions to use; SQL Server stores the before images in tempdb. By contrast, PostgreSQL always creates a new copy of a row whenever it is updated, so the before images live in the data blocks themselves: this has some advantages (commit and rollback are both very simple operations, no additional space to manage) with tradeoffs (those outdated row versions have to be vacuumed up in the background).

In PostgreSQL's case (and this is the DB I'm most familiar with the internals of) each row version on disk has some additional properties the transactions must examine to decide if that row version is "visible" to them. For simplicity, consider that they have "xmin" and "xmax"- "xmin" specifies the transaction ID that created the row version, "xmax" the (optional) transaction ID that deleted it (which may include creating a new row version to represent an update to the row). So you start with a row created by txn#20:

xmin xmax id value
20   -    1  FOO

and then txn#25 performs update t set value = 'BAR' where id = 1

20   25   1  FOO
25   -    1  BAR

Until txn#25 is finished, new transactions will know to regard its changes as not visible. So a transaction scanning this table will take the "FOO" version, since its xmax is a non-visible transaction.

If txn#25 is rolled back, new transactions will not immediately skip it, but will consider whether txn#25 committed or rolled back. (PostgreSQL manages a "commit status" lookup table to serve this, pg_clog) Since txn#25 rolled back, its changes are not visible, so again the "FOO" version is taken. (And the "BAR" version is skipped since its xmin transaction is invisible)

If txn#25 is committed, then the "FOO" row version is now not taken, since its xmax transaction is visible (that is, the changes made by that transaction are now visible). By contrast, the "BAR" row version is taken, since its xmin transaction is visible (and it has no xmax)

While txn#25 is still in progress (again this can be read from pg_clog) any other transaction that wants to update the row will wait until txn#25 completes, by trying to take a shared lock on the transaction ID. I'm highlighting this point, it's why PostgreSQL doesn't usually have "row locks" as such, only transaction locks: there is no in-memory lock for each row changed. (Locking using select ... for update is done by setting xmax and a flag to indicate xmax just indicates locking not deletion)

Oracle... does something somewhat similar but my knowledge of the details are much hazier. In Oracle each transaction is issued a System Change Number, and that is recorded in the top of each block. When a block changes, its original contents are put in the undo space with the new block pointing at the old block: so you essentially have a linked list of versions of block N- the latest version in the data file, with the progressively older versions in the undo tablespace. And at the top of the block is a list of the "interested transactions" which somehow implements locking (again not having an in-memory lock for each row changed), and I can't remember the details beyond that.

SQL Server's snapshot isolation mechanism I believe is largely similar to Oracle's, using tempdb to store blocks that are being changed rather than a dedicated file.

Hope this rambling answer was useful. It's all from memory so large swathes of misinformation are possible, particularly for the non-postgresql implementations.

  • http://devcenter.heroku.com/articles/postgresql-concurrency
  • https://gist.github.com/1781649 - runnable demo of MVCC on PostgreSQL
  • http://simpledbm.googlecode.com/files/mvcc-survey-1.0.pdf
like image 133
araqnid Avatar answered Sep 19 '22 00:09

araqnid