It appears that if my program attempts to write to a full file system it it will initially error with 'No space left on device' nearly instantly, but if I leave it running for a minute or so it slows down by a couple orders of magnitude.
Note this is a minimum test case, this behaviour was first noticed in a Java logging framework, hence the small (512 byte) output chunks and a lot of them.
main.c
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <stddef.h>
#include <time.h>
#include <sys/time.h>
long microTime() {
struct timeval time;
gettimeofday(&time, NULL);
return time.tv_sec * 1000 * 1000 + time.tv_usec;
}
int main(int argc, const char * argv[]) {
char *payload ;
payload = (char *)malloc(512 * sizeof(char));
if (payload == NULL) {
printf("Failed to alloc memory\n");
exit(1);
}
FILE *fp;
fp = fopen(argv[1], "a+");
printf("opened [%s]\n", argv[1]);
if (fp == NULL) {
printf("Failed to open [%s]\n", argv[1]);
exit(1);
}
for (;;) {
int batchSize = 100000;
bool errored = false;
long runStart = microTime();
for (int i = 0; i < batchSize; i ++) {
size_t result = fwrite(payload, 512 * sizeof(char), 1, fp);
if (result == 0 && !errored) {
perror("Failed to write to disk");
errored = true;
}
}
long runEnd = microTime();
printf("total elapsed %dms\n", (int)(runEnd - runStart) / 1000);
}
return 0;
}
(please excuse my C, this is probably the first C program I've written in nearly 20 years)
Running:
gcc -std=c99 main.c && ./a.out /path/to/somewhere/file1.bin
Warning this program will fill your disk partition
output:
total elapsed 42ms
total elapsed 105ms
total elapsed 104ms
total elapsed 125ms
... skip until the disk fills
Failed to write to disk: No space left on device
total elapsed 104ms
Failed to write to disk: No space left on device
total elapsed 76ms
Failed to write to disk: No space left on device
total elapsed 84ms
Failed to write to disk: No space left on device
... then skip a little more about one minute
total elapsed 8096ms
Failed to write to disk: No space left on device
total elapsed 43245ms
Failed to write to disk: No space left on device
total elapsed 48670ms
Failed to write to disk: No space left on device
total elapsed 45929ms
Failed to write to disk: No space left on device
My expectation is that this program would run for ever with a fairly low constant time to write to disk.
I've run this on a local centos 6.4 vagrant box, an amazon linux and an ubuntu 14.04 ec2 boxes with exactly the same results. Interestingly it doesn't seem to happen on OSX 10.9.5 attempting to fill a disk image.
So my question is really, what is causing this apparent throttling?
update: Run with strace -t -T
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.000011>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.000011>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000066>
10:27:38 dup(2) = 4 <0.000006>
10:27:38 fcntl(4, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) <0.000011>
10:27:38 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000006>
10:27:38 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa8e97f2000 <0.000009>
10:27:38 lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) <0.000005>
10:27:38 write(4, "Failed to write to disk: No spac"..., 49Failed to write to disk: No space left on device
) = 49 <0.000006>
10:27:38 close(4) = 0 <0.000006>
10:27:38 munmap(0x7fa8e97f2000, 4096) = 0 <0.000015>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000026>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000017>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000015>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000015>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000015>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
... skipping to the end
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005747>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005231>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005496>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005870>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005823>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005841>
not the call times to write()
go from ~0.00001s => ~0.005s by the end of the run.
The full run.log is 250mb
update 2: Adding CPU usage detail for the various phases:
Staring the run, ie filling the disk
top - 11:00:56 up 2:52, 2 users, load average: 1.79, 0.99, 0.48
Tasks: 75 total, 3 running, 72 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.3%us, 71.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 14.6%hi, 6.3%si, 0.0%st
Mem: 603764k total, 561868k used, 41896k free, 134976k buffers
Swap: 1254392k total, 0k used, 1254392k free, 359020k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9861 vagrant 20 0 4056 500 412 R 54.5 0.1 0:01.88 a.out
766 root 20 0 0 0 0 R 36.9 0.0 0:28.51 flush-8:0
28 root 20 0 0 0 0 S 5.6 0.0 0:05.14 kswapd0
16 root 20 0 0 0 0 S 2.3 0.0 4:51.07 kblockd/0
Disk full, fast erroring
top - 11:01:11 up 2:52, 2 users, load average: 1.68, 1.01, 0.49
Tasks: 75 total, 2 running, 73 sleeping, 0 stopped, 0 zombie
Cpu(s): 9.0%us, 91.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 603764k total, 555424k used, 48340k free, 134976k buffers
Swap: 1254392k total, 0k used, 1254392k free, 352224k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9861 vagrant 20 0 4056 552 464 R 99.5 0.1 0:12.91 a.out
988 root 20 0 215m 1572 860 S 0.3 0.3 0:05.05 VBoxService
1 root 20 0 19228 1348 1072 S 0.0 0.2 0:00.24 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
Slow erroring
top - 11:03:03 up 2:54, 2 users, load average: 1.63, 1.14, 0.59
Tasks: 74 total, 3 running, 71 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.4%sy, 0.0%ni, 0.0%id, 98.8%wa, 0.4%hi, 0.4%si, 0.0%st
Mem: 603764k total, 555284k used, 48480k free, 134976k buffers
Swap: 1254392k total, 0k used, 1254392k free, 352308k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
216 root 20 0 0 0 0 R 3.7 0.0 4:50.72 jbd2/sda1-8
16 root 20 0 0 0 0 R 3.3 0.0 4:53.04 kblockd/0
9861 vagrant 20 0 4056 552 464 D 1.3 0.1 1:17.47 a.out
1 root 20 0 19228 1348 1072 S 0.0 0.2 0:00.24 init
Technically fwrite() is a blocking call in that it does not return until the procedure has completed. However the definition of completion for fwrite() is that the data you supply has been written to an internal file buffer.
In other words, fwrite(3) is just a library routine that collects up output into chunks, and then calls write(2) . Now, write(2) , is a system call which traps into the kernel.
You may be observing the effects of Linux virtual memory when write
system calls block and become synchronous. This is because your application continuously generates data to be written to disk and the disk cannot store the data as fast.
Alternatively, the background flush out kicks in sometime after the disk was filled, so that write
syscalls compete for a access with the kernel flush out thread to the filesystem internal structures, so that it takes longer to return ENOSPC
.
See Better Linux Disk Caching & Performance with vm.dirty_ratio & vm.dirty_background_ratio:
vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory that’s 3.2 GB of data that can be sitting in RAM before something is done.
vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With