I'm reading a big file (1-10 GB) and computing some simple statistic over it, like counting a character. Streaming makes sense in this context, so I'm using lazy ByteString
s. In particular, my main
looks like
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
contents <- BSL.readFile "path"
print $ computeStats contents
The details of computeStats
are probably unimportant in this context.
When running this with +RTS -sstderr
, I'm seeing this:
MUT time 0.938s ( 1.303s elapsed)
Note the difference between CPU time and elapsed time. In addition to that, running under /usr/bin/time
shows similar results:
0.89user 0.45system 0:01.35elapsed
The file I'm testing with is in tmpfs
, so actual disk performance shouldn't be a factor.
How can I reduce the system
time in this case? I tried setting explicitly the buffer size for the file handle (no statistically significant influence on the running times) as well as mmap
ing the file and wrapping that into a ByteString
(running times actually got worse). What else could be worth trying?
First, it seems like there is something wonky going on with your machine. When I run this program on a 1G file cached in memory or in an tmpfs filesystem (doesn't matter which), the system time is substantially smaller:
1.44user 0.14system 0:01.60elapsed 99%CPU (0avgtext+0avgdata 50256maxresident)
If you have any kind of other load or memory pressure that might account for these extra 300ms, I think you'll need to resolve that first before anything I say below will help, but...
Anyway, for my tests I've used a larger 5G testfile to make the system time easier to quantify. As a baseline, the C program:
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#define BUFLEN (1024*1024)
char buffer[BUFLEN];
int
main()
{
int nulls = 0;
int fd = open("/dev/shm/testfile5G.dat", O_RDONLY);
while (read(fd, buffer, BUFLEN) > 0) {
for (int i = 0; i < BUFLEN; ++i) {
if (!buffer[i]) ++nulls;
}
}
printf("%d\n", nulls);
}
compiled with gcc -O2
runs on my testfile with times:
real 0m2.035s
user 0m1.619s
sys 0m0.416s
For comparison, the Haskell program compiled with ghc -O2
:
import Data.Word
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
contents <- BSL.readFile "/scratch/buhr/testfile5G.dat"
print $ BSL.foldl' go 0 contents
where go :: Int -> Word8 -> Int
go n 0 = n + 1
go n _ = n
is quite a bit slower at the count but has nearly equivalent system time:
real 0m8.411s
user 0m7.966s
sys 0m0.444s
Simpler tests like cat testfile5G.dat >/dev/null
all give consistent system time results, so it's safe to conclude that the overhead of the read
calls, most likely the specific process of copying data from kernel to user address space, accounts for an appreciable fraction of the 410ms or so system time.
Contrary to your experience above, switching to mmap
should reduce this overhead. The Haskell program:
import System.Posix.IO
import Foreign.Ptr
import Foreign.ForeignPtr
import MMAP
import qualified Data.ByteString as BS
import qualified Data.ByteString.Internal as BS
-- exact length of file
len :: Integral a => a
len = 5368709120
main :: IO ()
main = do
fd <- openFd "/scratch/buhr/testfile5G.dat" ReadOnly Nothing defaultFileFlags
ptr <- newForeignPtr_ =<< castPtr <$>
mmap nullPtr len protRead (mkMmapFlags mapPrivate mempty) fd 0
let contents = BS.fromForeignPtr ptr 0 len
print $ BS.foldl' (+) 0 contents
runs with about the same user time but substantially reduced system time:
real 0m7.972s
user 0m7.791s
sys 0m0.181s
Note that it's absolutely critical to use a zero-copy method of turning the mmapped area into a strict ByteString
here.
At this point, I think we're probably down to the overhead of managing process page tables, and a C version using mmap:
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
size_t len = 5368709120;
int
main()
{
int nulls = 0;
int fd = open("/scratch/buhr/testfile5G.dat", O_RDONLY);
char *p = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
for (int i = 0; i < len; ++i) {
if (!p[i]) ++nulls;
}
printf("%d\n", nulls);
}
has similar system time:
real 0m1.888s
user 0m1.708s
sys 0m0.180s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With