Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: Writing a lot of small files the fastest or/and most efficient way

Tags:

php

io

Imagine that a campaign will have 10,000 to 30,000 files about 4kb each should be written to disk.

And, there will be a couple of campaigns running at the same time. 10 tops.

Currently, I'm going with the usual way: file_put_contents.

it gets the job done but in a slow way and its php process is taking 100% cpu usage all the way.

fopen, fwrite, fclose, well, the result is similar to file_put_contents.

I've tried some async io stuff such as php eio and swoole.

it's faster but it'll yield "too many open files" after some time.

php -r 'echo exec("ulimit -n");' the result is 800000.

Any help would be appreciated!


well, this is sort of embarrassing... you guys are correct, the bottleneck is how it generates the file content...

like image 449
Jesse Avatar asked Sep 11 '16 14:09

Jesse


1 Answers

I am assuming that you cannot follow SomeDude's very good advice on using databases instead, and you already have performed what hardware tuning could be performed (e.g. increasing cache, increasing RAM to avoid swap thrashing, purchasing SSD drives).

I'd try and offload the file generation to a different process.

You could e.g. install Redis and store the file content into the keystore, which is very fast. Then, a different, parallel process could extract the data from the keystore, delete it, and write to a disk file.

This removes all disk I/O from the main PHP process, and lets you monitor the backlog (how many keypairs are still unflushed: ideally zero) and concentrate on the bottleneck in content generation. You'll possibly need some extra RAM.

On the other hand, this is not too different from writing to a RAM disk. You could also output data to a RAM disk, and it would be probably even faster:

# As root
mkdir /mnt/ramdisk
mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk
mkdir /mnt/ramdisk/temp 
mkdir /mnt/ramdisk/ready
# Change ownership and permissions as appropriate

and in PHP:

$fp = fopen("/mnt/ramdisk/temp/{$file}", "w");
fwrite($fp, $data);
fclose($fp);
rename("/mnt/ramdisk/temp/{$file}", "/mnt/ramdisk/ready/{$file}");

and then have a different process (crontab? Or continuously running daemon?) move files from the "ready" directory of the RAM disk to the disk, deleting then the RAM ready file.

File System

The time required to create a file depends on the number of files in the directory, with various dependency functions that themselves depend on the file system. ext4, ext3, zfs, btrfs etc. will exhibit different behaviour. Specifically, you might experience significant slowdowns if the number of files exceeds some quantity.

So you might want to try timing the creation of a large number of sample files in one directory, and see how this time grows with the growth of the number. Keep in mind that there will be a performance penalty for access to different directories, so using straight away a very large number of subdirectories is again not recommended.

<?php
    $payload    = str_repeat("Squeamish ossifrage. \n", 253);
    $time       = microtime(true);
    for ($i = 0; $i < 10000; $i++) {
        $fp = fopen("file-{$i}.txt", "w");
        fwrite($fp, $payload);
        fclose($fp);
    }
    $time = microtime(true) - $time;
    for ($i = 0; $i < 10000; $i++) {
        unlink("file-{$i}.txt");
    }
    print "Elapsed time: {$time} s\n";

Creation of 10000 files takes 0.42 seconds on my system, but creation of 100000 files (10x) takes 5.9 seconds, not 4.2. On the other hand, creating one eighth of those files in 8 separate directories (the best compromise I found) takes 6.1 seconds, so it's not worthwhile.

But suppose that creating 300000 files took 25 seconds instead of 17.7; dividing those files in ten directories might take 22 seconds, and make the directory split worthwhile.

Parallel processing: r strategy

TL;DR this doesn't work so well on my system, though your mileage may vary. If the operations to be done are lengthy (here they are not) and differently bound from the main process, then it can be advantageous to offload them each to a different thread, provided you don't spawn too many threads.

You will need pcntl functions installed.

$payload    = str_repeat("Squeamish ossifrage. \n", 253);

$time       = microtime(true);
for ($i = 0; $i < 100000; $i++) {
    $pid = pcntl_fork();
    switch ($pid) {
        case 0:
            // Parallel execution.
            $fp = fopen("file-{$i}.txt", "w");
            fwrite($fp, $payload);
            fclose($fp);
            exit();
        case -1:
            echo 'Could not fork Process.';
            exit();
        default:
            break;
    }
}
$time = microtime(true) - $time;
print "Elapsed time: {$time} s\n";

(The fancy name r strategy is taken from biology).

In this example, spawning times are catastrophic if compared to what each child needs to do. Therefore, overall processing time skyrockets. With more complex children things would go better, but you must be careful not to turn the script into a fork bomb.

One possibility, if possible, could be to divide the files to be created into, say, chunks of 10% each. Each child would then change its working directory with chdir(), and create its files in a different directory. This would negate the penalty for writing files in different subdirectories (each child writes in its current directory), while benefiting from writing less files. In this case, with very lightweight and I/O bound operations in the child, again the strategy isn't worthwhile (I get doubled execution time).

Parallel processing: K strategy

TL;DR this is more complex but works well... on my system. Your mileage may vary. While r strategy involves lots of fire-and-forget threads, K strategy calls for a limited (possibly one) child which is nurtured carefully. Here we offload the creation of all the files to one parallel thread, and communicate with it via sockets.

$payload    = str_repeat("Squeamish ossifrage. \n", 253);

$sockets = array();
$domain = (strtoupper(substr(PHP_OS, 0, 3)) == 'WIN' ? AF_INET : AF_UNIX);
if (socket_create_pair($domain, SOCK_STREAM, 0, $sockets) === false) {
   echo "socket_create_pair failed. Reason: ".socket_strerror(socket_last_error());
}
$pid = pcntl_fork();
if ($pid == -1) {
    echo 'Could not fork Process.';
} elseif ($pid) {
    /*parent*/
    socket_close($sockets[0]);
} else {
    /*child*/
    socket_close($sockets[1]);
    for (;;) {
        $cmd = trim(socket_read($sockets[0], 5, PHP_BINARY_READ));
        if (false === $cmd) {
            die("ERROR\n");
        }
        if ('QUIT' === $cmd) {
            socket_write($sockets[0], "OK", 2);
            socket_close($sockets[0]);
            exit(0);
        }
        if ('FILE' === $cmd) {
            $file   = trim(socket_read($sockets[0], 20, PHP_BINARY_READ));
            $len    = trim(socket_read($sockets[0], 8, PHP_BINARY_READ));
            $data   = socket_read($sockets[0], $len, PHP_BINARY_READ);
            $fp     = fopen($file, "w");
            fwrite($fp, $data);
            fclose($fp);
            continue;
        }
        die("UNKNOWN COMMAND: {$cmd}");
    }
}

$time       = microtime(true);
for ($i = 0; $i < 100000; $i++) {
    socket_write($sockets[1], sprintf("FILE %20.20s%08.08s", "file-{$i}.txt", strlen($payload)));
    socket_write($sockets[1], $payload, strlen($payload));
    //$fp = fopen("file-{$i}.txt", "w");
    //fwrite($fp, $payload);
    //fclose($fp);
}
$time = microtime(true) - $time;
print "Elapsed time: {$time} s\n";

socket_write($sockets[1], "QUIT\n", 5);
$ok = socket_read($sockets[1], 2, PHP_BINARY_READ);
socket_close($sockets[1]);

THIS IS HUGELY DEPENDENT ON THE SYSTEM CONFIGURATION. For example on a mono-processor, mono-core, non-threading CPU, this is madness - you'll at least double the total runtime, but more likely it will go from three to ten times as slow.

So this is definitely not the way to pimp up something running on an old system.

On a modern multithreading CPU and supposing the main content creation loop is CPU bound, you may experience the reverse - the script might go ten times faster.

On my system, the "forking" solution above runs a bit less than three times faster. I expected more, but there you are.

Of course, whether the performance is worth the added complexity and maintenance, remains to be evaluated.

The bad news

While experimenting above, I came to the conclusion that file creation on a reasonably configured and performant machine in Linux is fast as hell, so not only it's difficult to squeeze more performances, but if you're experiencing slowness, it's very likely that it is not file related. Try detailing some more about how you create that content.

like image 72
LSerni Avatar answered Sep 29 '22 09:09

LSerni