I have a very large text file (around 20 GB and 300 million lines), which contains three columns separated by tabs:
word1 word2 word3
word1 word2 word3
word1 word2 word3
word1 word2 word3
word1, word2, and word3 are different in each line. word3 specifies the class of the line, and repeats often for different lines (having thousands of different values). The goal is to separate the file by the line class (word3). I.e. word1 and word2 should be stored in a file called word3, for all the lines. For example, for the line:
a b c
the string "a b" should be appended to the file called c.
Now I know how this can be done with a while loop, reading line by line of a file, and appending the proper file for each line:
while IFS='' read -r line || [[ -n "$line" ]]; do
# Variables
read -a line_array <<< ${line}
word1=${line_array[0]}
word2=${line_array[1]}
word3=${line_array[2]}
# Adding word1 and word2 to file word3
echo "${word1} ${word2}" >> ${word3}
done < "inputfile"
It works, but is very slow (even though I have a fast workstation with an SSD). How can this be speed up? I have already tried to carry out this procedure in /dev/shm, and splitted the file into 10 pieces and run the above script in parallel for each file. But it is still quite slow. Is there a way to further speed this up?
Let's generate an example file:
$ seq -f "%.0f" 3000000 | awk -F $'\t' '{print $1 FS "Col_B" FS int(2000*rand())}' >file
That generates a 3 million line file with 2,000 different values in column 3 similar to this:
$ head -n 3 file; echo "..."; tail -n 3 file
1 Col_B 1680
2 Col_B 788
3 Col_B 1566
...
2999998 Col_B 1562
2999999 Col_B 1803
3000000 Col_B 1252
With a simple awk
you can generate the files you describe this way:
$ time awk -F $'\t' '{ print $1 " " $2 >> $3; close($3) }' file
real 3m31.011s
user 0m25.260s
sys 3m0.994s
So that awk will generate the 2,000 group files in about 3 minutes 31 seconds. Certainly faster than Bash, but this can be faster by presorting the file by the third column and writing each group file in one go.
You can use the Unix sort
utility in a pipe and feed the output to a script that can separate the sorted groups to different files. Use the -s
option with sort
and the value of the third field will be the only fields that will change the order of the lines.
Since we can assume sort
has partitioned the file into groups based on column 3 of the file, the script only needs to detect when that value changes:
$ time sort -s -k3 file | awk -F $'\t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }'
real 0m4.727s
user 0m5.495s
sys 0m0.541s
Because of the efficiency gained by presorting, the same net process completes in 5 seconds.
If you are sure that the 'words' in column 3 are ascii only (ie, you do not need to deal with UTF-8), you can set LC_ALL=C
for additional speed:
$ time LC_ALL=C sort -s -k3 file | awk -F $'\t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }'
real 0m3.801s
user 0m3.796s
sys 0m0.479s
From comments:
1) Please add a line to explain why we need the bracketed expression in fn != ($3 "")
:
The awk
construct of fn != ($3 "") {action}
is an effective shortcut for fn != $3 || fn=="" {action}
use the one you consider most readable.
2) Not sure if this also works if the file is larger than the available memory, so this might be a limiting factor.:
I ran the first and the last awk with 300 million records and 20,000 output files. The last one with sort did the task in 12 minutes. The first took 10 hours...
It may be that the sort version actually scales better since opening appending and closing 20,000 files 300 million times takes a while. It is more efficient to gang up and stream similar data.
3) I was thinking about sort earlier but then felt it might not be the fastest because we have to read the whole file twice with this approach.:
This is the case for purely random data; if the actual data is somewhat ordered, there is a tradeoff with reading the file twice. The first awk would be significantly faster with less random data. But then it will also take time to determine if the file is sorted. If you know file is mostly sorted, use the first; if it is likely somewhat disordered, use the last.
You can use awk:
awk -F $'\t' '{ print $1 " " $2 >> $3; close($3) }' file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With