How could WAL （write ahead log） have better performance than write directly to disk?

Question

The WAL (Write-Ahead Log) technology has been used in many systems.

The mechanism of a WAL is that when a client writes data, the system does two things:

Write a log to disk and return to the client
Write the data to disk, cache or memory asynchronously

There are two benefits:

If some exception occurs (i.e. power loss) we can recover the data from the log.
The performance is good because we write data asynchronously and can batch operations

Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.

In this way, you still have those two benefits.

You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)

So what is the advantage of using a WAL?

janm · Accepted Answer

Performance.

Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.
Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.

This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.

Update:

SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.

midor · Answer

As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.

If you write the update directly to disk, there are two options:

write all records to the end of some file
the files are somehow structured

If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you

write to the WAL for persistence
update the memtables (in RAM)

Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.

How could WAL （write ahead log） have better performance than write directly to disk?

Tags:

database

rocksdb

ceph

Kramer Li

2 Answers

janm

midor

Recent Activity

Donate For Us

How could WAL （write ahead log） have better performance than write directly to disk?

Tags:

database

rocksdb

ceph

Kramer Li

2 Answers

janm

midor

Related questions

Recent Activity

Donate For Us