I had to replace all the null bytes in a file with another character (I arbitrarily chose @
), and was pretty surprised that tr '\00' '@'
was about 1/4 the speed of gzip
:
$ pv < lawl | gzip > /dev/null
^C13MiB 0:00:04 [28.5MiB/s] [====> ] 17% ETA 0:00:18
$ pv < lawl | tr '\00' '@' > /dev/null
^C58MiB 0:00:08 [7.28MiB/s] [==> ] 9% ETA 0:01:20
My real data file is 3GB gzipped and took 50 minutes to tr
, and I'll actually need to do this on many such files, so it's not a completely academic problem. Note that reading from disk (a reasonably fast SSD here), or pv
, isn't the bottleneck in either case; both gzip
and tr
are using 100% CPU, and cat
is much faster:
$ pv < lawl | cat > /dev/null
642MiB 0:00:00 [1.01GiB/s] [================================>] 100%
This code:
#include <stdio.h>
int main() {
int ch;
while ((ch = getchar()) != EOF) {
if (ch == '\00') {
putchar('@');
} else {
putchar(ch);
}
}
}
compiled with clang -O3
is somewhat faster:
$ pv < lawl | ./stupidtr > /dev/null
^C52MiB 0:00:06 [ 8.5MiB/s] [=> ] 8% ETA 0:01:0
Compiling with gcc -O4 -mtune=native -march=native
(4.8.4) is comparable, maybe very slightly faster. Adding -march=native
to clang (Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)
) produces an identical binary.
This is presumably just because the generic processing code for replacements in tr
is replaced with constants and the checks can be compiled down. The LLVM IR (clang -S -O3 stupidtr.c
) looks pretty good.
I guess gzip
must be faster because it's doing something SIMD instructions or something. Is it possible to get this up to gzip
speeds?
Some specifications, if they're relevant:
The file is a CSV; the null byte can only occur in a certain field, but some of the other fields are variable-length, so you can't just seek around arbitrarily. Most lines have a null byte in that field. I suppose this means you could do a Boyer-Moore search for ,\00,
, if that'd help. Once you've found a null byte, it's also guaranteed that there can't be another one for a hundred bytes or so.
A typical file is about 20 GiB uncompressed, but are bz2 compressed on disk, if that's relevant.
You can parallelize if you want, though gzip
does this with one so it shouldn't be necessary. I'll be running this either on a quad-core i7 running OSX or a two-vCPU cloud server running Linux.
Both machines I might run on have 16GB of RAM.
Combining ideas from the various answers with some extra bithacks, here is an optimized version:
#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#define BUFFER_SIZE 16384
#define REPLACE_CHAR '@'
int main(void) {
/* define buffer as uint64_t to force alignment */
/* make it one slot longer to allow for loop guard */
uint64_t buffer[BUFFER_SIZE/8 + 1];
ssize_t size, chunk;
uint64_t *p, *p_end;
uint64_t rep8 = (uint8_t)REPLACE_CHAR * 0x0101010101010101ULL;
while ((size = read(0, buffer, BUFFER_SIZE)) != 0) {
if (size < 0) {
if (errno == EINTR) continue;
fprintf(stderr, "read error: %s\n", strerror(errno));
return 1;
}
p = buffer;
p_end = p + ((size + 7) >> 3);
*p_end = 0ULL; /* force a 0 at the end */
for (;; p++) {
#define LOWBITS 0x0101010101010101ULL
#define HIGHBITS 0x8080808080808080ULL
uint64_t m = ((*p - LOWBITS) & ~*p & HIGHBITS);
if (m != 0) {
if (p >= p_end) break;
m |= m >> 1;
m |= m >> 2;
m |= m >> 4;
*p |= m & rep8;
}
}
for (unsigned char *pc = (unsigned char *)buffer;
(chunk = write(1, pc, (size_t)size)) != size;
pc += chunk, size -= chunk) {
if (chunk < 0) {
if (errno == EINTR) continue;
fprintf(stderr, "write error: %s\n", strerror(errno));
return 2;
}
}
}
return 0;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With