Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing kernel overhead when reading a huge file with lazy bytestrings

Tags:

linux

io

haskell

I'm reading a big file (1-10 GB) and computing some simple statistic over it, like counting a character. Streaming makes sense in this context, so I'm using lazy ByteStrings. In particular, my main looks like

import qualified Data.ByteString.Lazy as BSL

main :: IO ()
main = do
  contents <- BSL.readFile "path"
  print $ computeStats contents

The details of computeStats are probably unimportant in this context.

When running this with +RTS -sstderr, I'm seeing this:

MUT     time    0.938s  (  1.303s elapsed)

Note the difference between CPU time and elapsed time. In addition to that, running under /usr/bin/time shows similar results:

0.89user 0.45system 0:01.35elapsed

The file I'm testing with is in tmpfs, so actual disk performance shouldn't be a factor.

How can I reduce the system time in this case? I tried setting explicitly the buffer size for the file handle (no statistically significant influence on the running times) as well as mmaping the file and wrapping that into a ByteString (running times actually got worse). What else could be worth trying?

like image 228
0xd34df00d Avatar asked Jan 31 '20 16:01

0xd34df00d


1 Answers

First, it seems like there is something wonky going on with your machine. When I run this program on a 1G file cached in memory or in an tmpfs filesystem (doesn't matter which), the system time is substantially smaller:

1.44user 0.14system 0:01.60elapsed 99%CPU (0avgtext+0avgdata 50256maxresident)

If you have any kind of other load or memory pressure that might account for these extra 300ms, I think you'll need to resolve that first before anything I say below will help, but...

Anyway, for my tests I've used a larger 5G testfile to make the system time easier to quantify. As a baseline, the C program:

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>

#define BUFLEN (1024*1024)
char buffer[BUFLEN];

int
main()
{
        int nulls = 0;
        int fd = open("/dev/shm/testfile5G.dat", O_RDONLY);
        while (read(fd, buffer, BUFLEN) > 0) {
                for (int i = 0; i < BUFLEN; ++i) {
                        if (!buffer[i]) ++nulls;
                }
        }
        printf("%d\n", nulls);
}

compiled with gcc -O2 runs on my testfile with times:

real    0m2.035s
user    0m1.619s
sys     0m0.416s

For comparison, the Haskell program compiled with ghc -O2:

import Data.Word
import qualified Data.ByteString.Lazy as BSL

main :: IO ()
main = do
  contents <- BSL.readFile "/scratch/buhr/testfile5G.dat"
  print $ BSL.foldl' go 0 contents
    where go :: Int -> Word8 -> Int
          go n 0 = n + 1
          go n _ = n

is quite a bit slower at the count but has nearly equivalent system time:

real    0m8.411s
user    0m7.966s
sys     0m0.444s

Simpler tests like cat testfile5G.dat >/dev/null all give consistent system time results, so it's safe to conclude that the overhead of the read calls, most likely the specific process of copying data from kernel to user address space, accounts for an appreciable fraction of the 410ms or so system time.

Contrary to your experience above, switching to mmap should reduce this overhead. The Haskell program:

import System.Posix.IO
import Foreign.Ptr
import Foreign.ForeignPtr
import MMAP
import qualified Data.ByteString as BS
import qualified Data.ByteString.Internal as BS

-- exact length of file
len :: Integral a => a
len = 5368709120

main :: IO ()
main = do
  fd <- openFd "/scratch/buhr/testfile5G.dat" ReadOnly Nothing defaultFileFlags
  ptr <- newForeignPtr_ =<< castPtr <$>
    mmap nullPtr len protRead (mkMmapFlags mapPrivate mempty) fd 0
  let contents = BS.fromForeignPtr ptr 0 len
  print $ BS.foldl' (+) 0 contents

runs with about the same user time but substantially reduced system time:

real    0m7.972s
user    0m7.791s
sys     0m0.181s

Note that it's absolutely critical to use a zero-copy method of turning the mmapped area into a strict ByteString here.

At this point, I think we're probably down to the overhead of managing process page tables, and a C version using mmap:

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

size_t len = 5368709120;

int
main()
{
        int nulls = 0;
        int fd = open("/scratch/buhr/testfile5G.dat", O_RDONLY);
        char *p = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
        for (int i = 0; i < len; ++i) {
                if (!p[i]) ++nulls;
        }
        printf("%d\n", nulls);
}

has similar system time:

real    0m1.888s
user    0m1.708s
sys     0m0.180s
like image 76
K. A. Buhr Avatar answered Sep 28 '22 16:09

K. A. Buhr