I have 2 text files. One contains 3x the number of lines as the other. The smaller one contains headers which I would like to interleave with the lines of the larger text file in a 3:1 ratio
e.g.
small file:
header1
header2
header3
big file
lines1.1
lines1.2
lines1.3
lines2.1
lines2.2
lines2.3
lines3.1
lines3.2
lines3.3
becomes:
header1
lines1.1
lines1.2
lines1.3
header2
lines2.1
lines2.2
lines2.3
header3
lines3.1
lines3.2
lines3.3
I have a shell solution to my problem:
new_reads_no="$(wc -l small_file.txt | awk '{print $1}')"
sequence="$(seq 1 $new_reads_no)"
for i in $sequence
do
start=$((3*($i-1)+1))
end=$(($start+2))
awk -v c1=$i 'FNR==c1' small_file.txt >> Output.txt
awk -v s="$start" -v e="$end" 'NR>=s&&NR<=e' big_file.txt >> Output.txt
done
Which works great. However, my small file is 10 million lines. At the rate it's going at the moment, I've estimated it will be finished in about 1 year.
Any help speeding this up would be much appreciated. Either a simple shell loop-free one liner or even just a quick tool in another language would be awesome.
Good old paste
:
paste -d '\n' fsmall - - - <fbig
SYNOPSIS
paste [-s] [-d list] file ... file
OPERANDS file: A pathname of an input file. If-
is specified for one or more of the files, the standard input shall be used; the standard input shall be read one line at a time, circularly, for each instance of-
.source: POSIX paste
This means, each <hyphen>-character reads a line from stdin
, which is defined to be fbig
in this case. Three hyphens, means three lines.
Good old awk
without buffering:
awk -v r=3 '1;{for(i=1;i<=r;++i) {getline < "-"; print}}' fsmall <fbig
This method mimicks the idea of the paste
-solution. It uses getline
to avoid buffering of the small file. This is not really flexible and one should always be careful when using getline
[See All about getline]
Good old awk
with buffering:
awk -v r=3 '(NR==FNR){b[FNR]=$0;next}(FNR%r==1){print b[++c]}1' fsmall fbig
This buffers the small file. This could lead to performance issues when the small file is really big. (See the comment of Tripleee)
With GNU sed
sed -e 'R f2' -e 'R f2' -e 'R f2' f1
where f1
is the smaller file. The R
command reads one line at a time from the given file. The lines thus obtained gets appended after the current line that's read from f1
Repeatedly reopening each input file and seeking to the spot where you last stopped reading is horribly inefficient. Making matters worse, you are reading the entire input file through to the end each time, and just picking out one line or three along the way. You could at least exit
as soon as you have printed the stuff you wanted. But hang on.
Here is a simple Python script which does what you are asking for by simply keeping both files open and reading from each as you go.
with open('small_file.txt') as small, open('big_file.txt') as large:
for line in small:
print(line, end='')
for x in range(3):
print(large.readline(), end='')
If you would like to parametrize the file names, try
import sys
with open(sys.argv[1]) as small, open(sys.argv[2]) as large:
...
Output is to standard output, so if you saved the above into path/to/script.py
you can simply run this at the shell prompt:
python3 path/to/script.py small_file.txt big_file.txt >Output.txt
The use of end=''
is a minor hack to avoid having to pluck off the newline and have print
add it back.
As an afterthought, you can do much the same thing in a shell script;
while IFS= read -r line; do
printf '%s\n' "$line"
for x in 1 2 3; do
IFS= read -u 3 -r other
printf '%s\n' "$other"
done
done <small_file.txt 3<big_file.txt >Output.txt
but the shell's while read -r
loop is inherently much slower.
If your small file is small enough to fit in memory:
$ awk 'NR==FNR{hdrs[NR]=$0; next} NR%3 == 1{print hdrs[++c]} 1' small big
header1
lines1.1
lines1.2
lines1.3
header2
lines2.1
lines2.2
lines2.3
header3
lines3.1
lines3.2
lines3.3
otherwise:
$ awk '(NR%3 == 1) && ((getline hdr < "small") > 0){print hdr} 1' big
header1
lines1.1
lines1.2
lines1.3
header2
lines2.1
lines2.2
lines2.3
header3
lines3.1
lines3.2
lines3.3
See http://awk.freeshell.org/AllAboutGetline for why I'm using the syntax I'm using to call getline
and why it's best avoided if not necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With