How fast can we make a specific tr?

Question

I had to replace all the null bytes in a file with another character (I arbitrarily chose @), and was pretty surprised that tr '\00' '@' was about 1/4 the speed of gzip:

$ pv < lawl | gzip > /dev/null
^C13MiB 0:00:04 [28.5MiB/s] [====>                             ] 17% ETA 0:00:18
$ pv < lawl | tr '\00' '@' > /dev/null
^C58MiB 0:00:08 [7.28MiB/s] [==>                               ]  9% ETA 0:01:20

My real data file is 3GB gzipped and took 50 minutes to tr, and I'll actually need to do this on many such files, so it's not a completely academic problem. Note that reading from disk (a reasonably fast SSD here), or pv, isn't the bottleneck in either case; both gzip and tr are using 100% CPU, and cat is much faster:

$ pv < lawl | cat > /dev/null
 642MiB 0:00:00 [1.01GiB/s] [================================>] 100%

This code:

#include <stdio.h>

int main() {
    int ch;
    while ((ch = getchar()) != EOF) {
        if (ch == '\00') {
            putchar('@');
        } else {
            putchar(ch);
        }
    }
}

compiled with clang -O3 is somewhat faster:

$ pv < lawl | ./stupidtr > /dev/null
^C52MiB 0:00:06 [ 8.5MiB/s] [=>                                ]  8% ETA 0:01:0

Compiling with gcc -O4 -mtune=native -march=native (4.8.4) is comparable, maybe very slightly faster. Adding -march=native to clang (Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)) produces an identical binary.

This is presumably just because the generic processing code for replacements in tr is replaced with constants and the checks can be compiled down. The LLVM IR (clang -S -O3 stupidtr.c) looks pretty good.

I guess gzip must be faster because it's doing something SIMD instructions or something. Is it possible to get this up to gzip speeds?

Some specifications, if they're relevant:

The file is a CSV; the null byte can only occur in a certain field, but some of the other fields are variable-length, so you can't just seek around arbitrarily. Most lines have a null byte in that field. I suppose this means you could do a Boyer-Moore search for ,\00,, if that'd help. Once you've found a null byte, it's also guaranteed that there can't be another one for a hundred bytes or so.
A typical file is about 20 GiB uncompressed, but are bz2 compressed on disk, if that's relevant.
You can parallelize if you want, though gzip does this with one so it shouldn't be necessary. I'll be running this either on a quad-core i7 running OSX or a two-vCPU cloud server running Linux.
Both machines I might run on have 16GB of RAM.

chqrlie · Accepted Answer

Combining ideas from the various answers with some extra bithacks, here is an optimized version:

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define BUFFER_SIZE  16384
#define REPLACE_CHAR  '@'

int main(void) {
    /* define buffer as uint64_t to force alignment */
    /* make it one slot longer to allow for loop guard */
    uint64_t buffer[BUFFER_SIZE/8 + 1];
    ssize_t size, chunk;
    uint64_t *p, *p_end;
    uint64_t rep8 = (uint8_t)REPLACE_CHAR * 0x0101010101010101ULL;

    while ((size = read(0, buffer, BUFFER_SIZE)) != 0) {
        if (size < 0) {
            if (errno == EINTR) continue;
            fprintf(stderr, "read error: %s
", strerror(errno));
            return 1;
        }
        p = buffer;
        p_end = p + ((size + 7) >> 3);
        *p_end = 0ULL; /* force a 0 at the end */
        for (;; p++) {
#define LOWBITS   0x0101010101010101ULL
#define HIGHBITS  0x8080808080808080ULL
            uint64_t m = ((*p - LOWBITS) & ~*p & HIGHBITS);
            if (m != 0) {
                if (p >= p_end) break;
                m |= m >> 1;
                m |= m >> 2;
                m |= m >> 4;
                *p |= m & rep8;
            }
        }
        for (unsigned char *pc = (unsigned char *)buffer;
             (chunk = write(1, pc, (size_t)size)) != size;
             pc += chunk, size -= chunk) {
            if (chunk < 0) {
                if (errno == EINTR) continue;
                fprintf(stderr, "write error: %s
", strerror(errno));
                return 2;
            }
        }
    }
    return 0;
}

How fast can we make a specific tr?

Tags:

performance

c

io

Danica

1 Answers

chqrlie

Recent Activity

Donate For Us

How fast can we make a specific tr?

Tags:

performance

c

io

Danica

1 Answers

chqrlie

Related questions

Recent Activity

Donate For Us