For example, given a line a11b12c22d322 e...
the fields of break are the numbers or spaces, we want to transform it into
a
b
c
d
e
...
sed
need to read the whole line into memory, for gigabytes a line, it would not be efficient, and the job could not be done if we don't have sufficient memory.
EDIT:
Could anyone please explain how do grep, tr, Awk, perl, and python manipulate the memory in reading a large file? What and how much content do they read into memory once a time?
If you use gawk
(which is the default awk
on Linux, I believe), you can use the RS
parameter to specify that multi-digit numbers or spaces are recognized as line terminators instead of a new-line.
awk '{print}' RS="[[:digit:]]+| +" file.txt
As to your second question, all of these programs will need to read some fixed number of bytes and search for its idea of a line separator in an internal buffer to simulate the appearance of reading a single line at a time. To prevent it from reading too much data while searching for the end of the line, you need to change the programs idea of what terminates a line.
Most languages allow you to do this, but only allow you to specify a single character. gawk
makes it easy by allowing you to specify a regular expression to recognize an end-of-line character. This saves you from having to implement the fixed-size buffer and end-of-line search yourself.
Fastest... You can do it with help of gcc, here's a version which reads data from given file name if given, otherwise from stdin. If this is still too slow, you can see if you can make it faster by replacing getchar()
and putchar()
(which may be macros and should optimize very well) with your own buffering code. If we want to get ridiculous, for even faster, you should have three threads, so kernel can copy next block of data with one core, while another core does processing, and third core copies processed output back to kernel.
#!/bin/bash
set -e
BINNAME=$(mktemp)
gcc -xc -O3 -o $BINNAME - <<"EOF"
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
int sep = 0;
/* speed is a requirement, so let's reduce io overhead */
const int bufsize = 1024*1024;
setvbuf(stdin, malloc(bufsize), _IOFBF, bufsize);
setvbuf(stdout, malloc(bufsize), _IOFBF, bufsize);
/* above buffers intentionally not freed, it doesn't really matter here */
int ch;
while((ch = getc(stdin)) >= 0) {
if (isdigit(ch) || isspace(ch)) {
if (!sep) {
if (putc('\n', stdout) == EOF) break;
sep = 1;
}
} else {
sep = 0;
if (putc(ch, stdout) == EOF) break;
}
}
/* flush should happen by on-exit handler, as buffer is not freed,
but this will detect write errors, for program exit code */
fflush(stdout);
return ferror(stdin) || ferror(stdout);
}
EOF
if [ -z "$1" ] ; then
$BINNAME <&0
else
$BINNAME <"$1"
fi
Edit: I happened too look at GNU/Linux stdio.h, some notes: putchar
/getchar
are not macros, but putc
/getc
are, so using those instead might be a slight optimization, probably avoiding one function call, changed code to reflect this. Also added checking return code of putc
, while at it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With