Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What throttles the fwrite() calls to a full disk on linux?

Tags:

c

linux

It appears that if my program attempts to write to a full file system it it will initially error with 'No space left on device' nearly instantly, but if I leave it running for a minute or so it slows down by a couple orders of magnitude.

Note this is a minimum test case, this behaviour was first noticed in a Java logging framework, hence the small (512 byte) output chunks and a lot of them.

main.c

#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <stddef.h>
#include <time.h>
#include <sys/time.h>


long microTime() {
    struct timeval time;
    gettimeofday(&time, NULL);

    return time.tv_sec * 1000 * 1000 + time.tv_usec;
}

int main(int argc, const char * argv[]) {

    char *payload ;
    payload = (char *)malloc(512 * sizeof(char));

    if (payload == NULL) {
        printf("Failed to alloc memory\n");
        exit(1);
    }

    FILE *fp;
    fp = fopen(argv[1], "a+");

    printf("opened [%s]\n", argv[1]);

    if (fp == NULL) {
        printf("Failed to open [%s]\n", argv[1]);
        exit(1);
    }

    for (;;) {
        int batchSize = 100000;
        bool errored = false;
        long runStart = microTime();
        for (int i = 0; i < batchSize; i ++) {
            size_t result = fwrite(payload, 512 * sizeof(char), 1, fp);
            if (result == 0 && !errored) {
                perror("Failed to write to disk");
                errored = true;
            }

        }
        long runEnd = microTime();
        printf("total elapsed %dms\n", (int)(runEnd - runStart) / 1000);
    }

    return 0;
}

(please excuse my C, this is probably the first C program I've written in nearly 20 years)

Running:

gcc -std=c99 main.c && ./a.out /path/to/somewhere/file1.bin

Warning this program will fill your disk partition

output:

total elapsed 42ms
total elapsed 105ms
total elapsed 104ms
total elapsed 125ms
... skip until the disk fills
Failed to write to disk: No space left on device
total elapsed 104ms
Failed to write to disk: No space left on device
total elapsed 76ms
Failed to write to disk: No space left on device
total elapsed 84ms
Failed to write to disk: No space left on device
... then skip a little more about one minute
total elapsed 8096ms
Failed to write to disk: No space left on device
total elapsed 43245ms
Failed to write to disk: No space left on device
total elapsed 48670ms
Failed to write to disk: No space left on device
total elapsed 45929ms
Failed to write to disk: No space left on device

My expectation is that this program would run for ever with a fairly low constant time to write to disk.

I've run this on a local centos 6.4 vagrant box, an amazon linux and an ubuntu 14.04 ec2 boxes with exactly the same results. Interestingly it doesn't seem to happen on OSX 10.9.5 attempting to fill a disk image.

So my question is really, what is causing this apparent throttling?

update: Run with strace -t -T

10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.000011>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.000011>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000066>
10:27:38 dup(2)                         = 4 <0.000006>
10:27:38 fcntl(4, F_GETFL)              = 0x8002 (flags O_RDWR|O_LARGEFILE) <0.000011>
10:27:38 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000006>
10:27:38 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa8e97f2000 <0.000009>
10:27:38 lseek(4, 0, SEEK_CUR)          = -1 ESPIPE (Illegal seek) <0.000005>
10:27:38 write(4, "Failed to write to disk: No spac"..., 49Failed to write to disk: No space left on device
) = 49 <0.000006>
10:27:38 close(4)                       = 0 <0.000006>
10:27:38 munmap(0x7fa8e97f2000, 4096)   = 0 <0.000015>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000026>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000017>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000015>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000015>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000015>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>
10:27:38 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.000016>

... skipping to the end
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005747>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005231>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005496>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005870>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005823>
10:30:02 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = -1 ENOSPC (No space left on device) <0.005841>

not the call times to write() go from ~0.00001s => ~0.005s by the end of the run.

The full run.log is 250mb

update 2: Adding CPU usage detail for the various phases:

Staring the run, ie filling the disk

top - 11:00:56 up  2:52,  2 users,  load average: 1.79, 0.99, 0.48
Tasks:  75 total,   3 running,  72 sleeping,   0 stopped,   0 zombie
Cpu(s):  7.3%us, 71.8%sy,  0.0%ni,  0.0%id,  0.0%wa, 14.6%hi,  6.3%si,  0.0%st
Mem:    603764k total,   561868k used,    41896k free,   134976k buffers
Swap:  1254392k total,        0k used,  1254392k free,   359020k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9861 vagrant   20   0  4056  500  412 R 54.5  0.1   0:01.88 a.out
  766 root      20   0     0    0    0 R 36.9  0.0   0:28.51 flush-8:0
   28 root      20   0     0    0    0 S  5.6  0.0   0:05.14 kswapd0
   16 root      20   0     0    0    0 S  2.3  0.0   4:51.07 kblockd/0

Disk full, fast erroring

top - 11:01:11 up  2:52,  2 users,  load average: 1.68, 1.01, 0.49
Tasks:  75 total,   2 running,  73 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.0%us, 91.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    603764k total,   555424k used,    48340k free,   134976k buffers
Swap:  1254392k total,        0k used,  1254392k free,   352224k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9861 vagrant   20   0  4056  552  464 R 99.5  0.1   0:12.91 a.out
  988 root      20   0  215m 1572  860 S  0.3  0.3   0:05.05 VBoxService
    1 root      20   0 19228 1348 1072 S  0.0  0.2   0:00.24 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd

Slow erroring

top - 11:03:03 up  2:54,  2 users,  load average: 1.63, 1.14, 0.59
Tasks:  74 total,   3 running,  71 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.4%sy,  0.0%ni,  0.0%id, 98.8%wa,  0.4%hi,  0.4%si,  0.0%st
Mem:    603764k total,   555284k used,    48480k free,   134976k buffers
Swap:  1254392k total,        0k used,  1254392k free,   352308k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  216 root      20   0     0    0    0 R  3.7  0.0   4:50.72 jbd2/sda1-8
   16 root      20   0     0    0    0 R  3.3  0.0   4:53.04 kblockd/0
 9861 vagrant   20   0  4056  552  464 D  1.3  0.1   1:17.47 a.out
    1 root      20   0 19228 1348 1072 S  0.0  0.2   0:00.24 init
like image 589
Gareth Davis Avatar asked Jan 19 '15 09:01

Gareth Davis


People also ask

Is fwrite a blocking call?

Technically fwrite() is a blocking call in that it does not return until the procedure has completed. However the definition of completion for fwrite() is that the data you supply has been written to an internal file buffer.

Is fwrite a system call?

In other words, fwrite(3) is just a library routine that collects up output into chunks, and then calls write(2) . Now, write(2) , is a system call which traps into the kernel.


1 Answers

You may be observing the effects of Linux virtual memory when write system calls block and become synchronous. This is because your application continuously generates data to be written to disk and the disk cannot store the data as fast.

Alternatively, the background flush out kicks in sometime after the disk was filled, so that write syscalls compete for a access with the kernel flush out thread to the filesystem internal structures, so that it takes longer to return ENOSPC.

See Better Linux Disk Caching & Performance with vm.dirty_ratio & vm.dirty_background_ratio:

vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory that’s 3.2 GB of data that can be sitting in RAM before something is done.

vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

like image 99
Maxim Egorushkin Avatar answered Oct 13 '22 23:10

Maxim Egorushkin