Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interleave text files with a given ratio of lines from file1 to file2

Tags:

text

bash

shell

awk

I have 2 text files. One contains 3x the number of lines as the other. The smaller one contains headers which I would like to interleave with the lines of the larger text file in a 3:1 ratio

e.g.

small file:

header1
header2
header3

big file

lines1.1
lines1.2
lines1.3
lines2.1
lines2.2
lines2.3
lines3.1
lines3.2
lines3.3

becomes:

header1
lines1.1
lines1.2
lines1.3
header2
lines2.1
lines2.2
lines2.3
header3
lines3.1
lines3.2
lines3.3

I have a shell solution to my problem:

new_reads_no="$(wc -l small_file.txt | awk '{print $1}')"
sequence="$(seq 1 $new_reads_no)"

for i in $sequence
do
    start=$((3*($i-1)+1))
    end=$(($start+2))
    awk -v c1=$i 'FNR==c1' small_file.txt >> Output.txt
    awk -v s="$start" -v e="$end" 'NR>=s&&NR<=e' big_file.txt >> Output.txt
done

Which works great. However, my small file is 10 million lines. At the rate it's going at the moment, I've estimated it will be finished in about 1 year.

Any help speeding this up would be much appreciated. Either a simple shell loop-free one liner or even just a quick tool in another language would be awesome.

like image 900
Nicholas Bailey Avatar asked Feb 07 '20 11:02

Nicholas Bailey


4 Answers

Good old paste:

paste -d '\n' fsmall - - - <fbig

SYNOPSIS paste [-s] [-d list] file ... file
OPERANDS file: A pathname of an input file. If - is specified for one or more of the files, the standard input shall be used; the standard input shall be read one line at a time, circularly, for each instance of -.

source: POSIX paste

This means, each <hyphen>-character reads a line from stdin, which is defined to be fbig in this case. Three hyphens, means three lines.

Good old awk without buffering:

awk -v r=3 '1;{for(i=1;i<=r;++i) {getline < "-"; print}}' fsmall <fbig

This method mimicks the idea of the paste-solution. It uses getline to avoid buffering of the small file. This is not really flexible and one should always be careful when using getline [See All about getline]

Good old awk with buffering:

awk -v r=3 '(NR==FNR){b[FNR]=$0;next}(FNR%r==1){print b[++c]}1' fsmall fbig

This buffers the small file. This could lead to performance issues when the small file is really big. (See the comment of Tripleee)

like image 196
kvantour Avatar answered Sep 21 '22 16:09

kvantour


With GNU sed

sed -e 'R f2' -e 'R f2' -e 'R f2' f1

where f1 is the smaller file. The R command reads one line at a time from the given file. The lines thus obtained gets appended after the current line that's read from f1

like image 40
Sundeep Avatar answered Sep 21 '22 16:09

Sundeep


Repeatedly reopening each input file and seeking to the spot where you last stopped reading is horribly inefficient. Making matters worse, you are reading the entire input file through to the end each time, and just picking out one line or three along the way. You could at least exit as soon as you have printed the stuff you wanted. But hang on.

Here is a simple Python script which does what you are asking for by simply keeping both files open and reading from each as you go.

with open('small_file.txt') as small, open('big_file.txt') as large:
    for line in small:
        print(line, end='')
        for x in range(3):
            print(large.readline(), end='')

If you would like to parametrize the file names, try

import sys

with open(sys.argv[1]) as small, open(sys.argv[2]) as large:
    ...

Output is to standard output, so if you saved the above into path/to/script.py you can simply run this at the shell prompt:

python3 path/to/script.py small_file.txt big_file.txt >Output.txt

The use of end='' is a minor hack to avoid having to pluck off the newline and have print add it back.

As an afterthought, you can do much the same thing in a shell script;

while IFS= read -r line; do
    printf '%s\n' "$line"
    for x in 1 2 3; do
        IFS= read -u 3 -r other
        printf '%s\n' "$other"
    done
done <small_file.txt 3<big_file.txt >Output.txt

but the shell's while read -r loop is inherently much slower.

like image 30
tripleee Avatar answered Sep 20 '22 16:09

tripleee


If your small file is small enough to fit in memory:

$ awk 'NR==FNR{hdrs[NR]=$0; next} NR%3 == 1{print hdrs[++c]} 1' small big
header1
lines1.1
lines1.2
lines1.3
header2
lines2.1
lines2.2
lines2.3
header3
lines3.1
lines3.2
lines3.3

otherwise:

$ awk '(NR%3 == 1) && ((getline hdr < "small") > 0){print hdr} 1' big
header1
lines1.1
lines1.2
lines1.3
header2
lines2.1
lines2.2
lines2.3
header3
lines3.1
lines3.2
lines3.3

See http://awk.freeshell.org/AllAboutGetline for why I'm using the syntax I'm using to call getline and why it's best avoided if not necessary.

like image 33
Ed Morton Avatar answered Sep 20 '22 16:09

Ed Morton