At first glance it seems like a good idea to let the hard disk write to RAM on its own, without CPU instructions copying data, particularly with the success of asynchronous networking in mind. But the Wikipedia article on Direct Memory Access (DMA) states this:
With DMA, the CPU gets freed from this overhead and can do useful tasks during data transfer (though the CPU bus would be partly blocked by DMA).
I don't understand how a bus line can be "partly blocked". Presumably memory can be accessed by one device at the time, and it then seems like there is little useful work the CPU can actually do. It would be blocked on the first attempt to read uncached memory, which I expect is very quickly in the case of a 2 mb cache.
The goal of freeing up the CPU to do other tasks seems gratuitous. Does hard disk DMA foster any performance increase in practice?
1: PIO (programmed IO) thrashes the CPU caches. The data read from the disk will, most of the time, not be processed immediately afterwards. Data is often read in large chunks by the application, but PIO is done in smaller blocks (typically 64K IIRC). So the data-reading application will wait until the large chunk has been transferred, and not benefit from the smaller blocks being in the cache just after they have been fetched from the controller. Meanwhile other applications will suffer from large parts of the cache being evicted by the transfer. This could probably be avoided by using special instructions which instruct the CPU not to cache data but write it "directly" to the main memory, however I'm pretty certain that this would slow down the copy-loop. And thereby hurt even more than the cache-thrashing.
2: PIO, as it's implemented on x86 systems, and probably most other systems, is really slow compared to DMA. The problem is not that the CPU wouldn't be fast enough. The problem stems from the way the bus and the disk controller's PIO modes are designed. If I'm not mistaken, the CPU has to read every byte (or every DWORD when using 32 bit PIO modes) from a so-called IO port. That means for every DWORD of data, the port's address has to be put on the bus, and the controller must respond by putting the data DWORD on the bus. Whereas when using DMA, the controller can transfer bursts of data, utilizing the full bandwidth of the bus and/or memory controller. Of course there is much room for optimizing this legacy PIO design. DMA transfers are such an optimization. Other solutions that would still be considered PIO might be possible too, but then again they would still suffer from other problems (e.g. the cache thrashing mentioned above).
3: Memory- and/or bus-bandwidth is not the limiting factor for most applications, so the DMA transfer will not stall anything. It might slow some applications down a little, but usually it should be hardly noticeable. After all disks are rather slow compared with the bandwidth of the bus and/or memory controller. A "disk" (SSD, RAID array) that delivers > 500 MB/s is really fast. A bus or memory subsystem that cannot at least deliver 10 times that number must be from the stone ages. OTOH PIO really stalls the CPU completely while it's transferring a block of data.
I don't know if I'm missing anything.
Let's suppose we don't have DMA controller. Every transfer from the "slow" devices to the memory would be for the CPU a loop
ask_for_a_block_to_device
wait_until_device_answer (or change_task_and_be_interrupted_when_ready)
write_to_memory
So the CPU should have to write the memory itself. Chunk by chunk.
Is it necessary the use of a CPU for doing memory transfers? No. We use another device (or mecanism like DMA bus mastering) which transfers data to/from the memory.
Meanwhile CPU could be doing something different like : doing things with cache, but even accessing other parts of the memory a great share of the time.
This is the crucial part: data is not being transfered 100% of the time, because the other device is very slow (compared to memory and CPU).
Trying to represent an example of the shared memory bus usage (C when accesed by CPU, D, when accesed by DMA)
Memory Bus ----CCCCCCCC---D----CCCCCCCCCDCCCCCCCCC----D
As you can see memory is accesed one device at a time. Sometimes by CPU, sometimes by the DMA controller. The DMA very few times.
I don't understand how a bus line can be "partly blocked"
Over a period of many clock cycles, some will be blocked and some will not. Quoting the University of Melbourne:
Q2. What is cycle stealing? Why are there cycles to steal?
A2. When a DMA device transfers data to or from memory, it will (in most architectures) use the same bus as the CPU would use to access memory. If the CPU wants to use the bus at the same time as a DMA device, the CPU will stall for a cycle, since the DMA device has the higher priority. This is necessary to prevent overruns with small DMA buffers. (The CPU never suffers from overruns.)
Most modern CPUs have caches that satisfy most memory references without having to go to main memory through the bus. DMA will therefore have much less impact on them.
Even if the CPU is completely starved while a DMA block transfer is occurring, it will happen faster than if the CPU had to sit in a loop shifting bytes to/from an I/O device.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With