Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to autobreak a line of gigabytes separated by keywords using bash shell?

For example, given a line a11b12c22d322 e... the fields of break are the numbers or spaces, we want to transform it into

a
b
c
d
e
...

sed need to read the whole line into memory, for gigabytes a line, it would not be efficient, and the job could not be done if we don't have sufficient memory.

EDIT:

Could anyone please explain how do grep, tr, Awk, perl, and python manipulate the memory in reading a large file? What and how much content do they read into memory once a time?

like image 451
Osiris Xu Avatar asked Dec 07 '22 09:12

Osiris Xu


2 Answers

If you use gawk (which is the default awk on Linux, I believe), you can use the RS parameter to specify that multi-digit numbers or spaces are recognized as line terminators instead of a new-line.

awk '{print}' RS="[[:digit:]]+| +" file.txt

As to your second question, all of these programs will need to read some fixed number of bytes and search for its idea of a line separator in an internal buffer to simulate the appearance of reading a single line at a time. To prevent it from reading too much data while searching for the end of the line, you need to change the programs idea of what terminates a line.

Most languages allow you to do this, but only allow you to specify a single character. gawk makes it easy by allowing you to specify a regular expression to recognize an end-of-line character. This saves you from having to implement the fixed-size buffer and end-of-line search yourself.

like image 148
chepner Avatar answered Dec 08 '22 22:12

chepner


Fastest... You can do it with help of gcc, here's a version which reads data from given file name if given, otherwise from stdin. If this is still too slow, you can see if you can make it faster by replacing getchar() and putchar() (which may be macros and should optimize very well) with your own buffering code. If we want to get ridiculous, for even faster, you should have three threads, so kernel can copy next block of data with one core, while another core does processing, and third core copies processed output back to kernel.

#!/bin/bash

set -e

BINNAME=$(mktemp)

gcc -xc -O3 -o $BINNAME - <<"EOF"
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
    int sep = 0;

    /* speed is a requirement, so let's reduce io overhead */
    const int bufsize = 1024*1024;
    setvbuf(stdin, malloc(bufsize), _IOFBF, bufsize);
    setvbuf(stdout, malloc(bufsize), _IOFBF, bufsize);
    /* above buffers intentionally not freed, it doesn't really matter here */

    int ch;
    while((ch = getc(stdin)) >= 0) {
        if (isdigit(ch) || isspace(ch)) {
            if (!sep) {
                if (putc('\n', stdout) == EOF) break;
                sep = 1;
            }
        } else {
            sep = 0;
            if (putc(ch, stdout) == EOF) break;
        }
    }

    /* flush should happen by on-exit handler, as buffer is not freed,
       but this will detect write errors, for program exit code */
    fflush(stdout); 

    return ferror(stdin) || ferror(stdout);
}
EOF

if [ -z "$1" ] ; then
  $BINNAME <&0
else
  $BINNAME <"$1"
fi

Edit: I happened too look at GNU/Linux stdio.h, some notes: putchar/getchar are not macros, but putc/getc are, so using those instead might be a slight optimization, probably avoiding one function call, changed code to reflect this. Also added checking return code of putc, while at it.

like image 29
hyde Avatar answered Dec 08 '22 22:12

hyde