Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up separation of large text file based on line content in Bash

Tags:

linux

bash

io

awk

I have a very large text file (around 20 GB and 300 million lines), which contains three columns separated by tabs:

word1 word2 word3
word1 word2 word3
word1 word2 word3
word1 word2 word3

word1, word2, and word3 are different in each line. word3 specifies the class of the line, and repeats often for different lines (having thousands of different values). The goal is to separate the file by the line class (word3). I.e. word1 and word2 should be stored in a file called word3, for all the lines. For example, for the line:

a b c

the string "a b" should be appended to the file called c.

Now I know how this can be done with a while loop, reading line by line of a file, and appending the proper file for each line:

while IFS='' read -r line || [[ -n "$line" ]]; do
    # Variables
    read -a line_array <<< ${line}
    word1=${line_array[0]}
    word2=${line_array[1]}
    word3=${line_array[2]}

    # Adding word1 and word2 to file word3
    echo "${word1} ${word2}" >> ${word3}  
done < "inputfile"

It works, but is very slow (even though I have a fast workstation with an SSD). How can this be speed up? I have already tried to carry out this procedure in /dev/shm, and splitted the file into 10 pieces and run the above script in parallel for each file. But it is still quite slow. Is there a way to further speed this up?

like image 947
Jadzia Avatar asked Oct 20 '18 14:10

Jadzia


2 Answers

Let's generate an example file:

$ seq -f "%.0f" 3000000 | awk -F $'\t' '{print $1 FS "Col_B" FS int(2000*rand())}' >file

That generates a 3 million line file with 2,000 different values in column 3 similar to this:

$ head -n 3 file; echo "..."; tail -n 3 file
1   Col_B   1680
2   Col_B   788
3   Col_B   1566
...
2999998 Col_B   1562
2999999 Col_B   1803
3000000 Col_B   1252

With a simple awk you can generate the files you describe this way:

$ time awk -F $'\t' '{ print $1 " " $2 >> $3; close($3) }' file
real    3m31.011s
user    0m25.260s
sys     3m0.994s

So that awk will generate the 2,000 group files in about 3 minutes 31 seconds. Certainly faster than Bash, but this can be faster by presorting the file by the third column and writing each group file in one go.

You can use the Unix sort utility in a pipe and feed the output to a script that can separate the sorted groups to different files. Use the -s option with sort and the value of the third field will be the only fields that will change the order of the lines.

Since we can assume sort has partitioned the file into groups based on column 3 of the file, the script only needs to detect when that value changes:

$ time sort -s -k3 file | awk -F $'\t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }'
real    0m4.727s
user    0m5.495s
sys     0m0.541s

Because of the efficiency gained by presorting, the same net process completes in 5 seconds.

If you are sure that the 'words' in column 3 are ascii only (ie, you do not need to deal with UTF-8), you can set LC_ALL=C for additional speed:

$ time LC_ALL=C sort -s -k3 file | awk -F $'\t' 'fn != ($3 "") { close(fn); fn = $3 } { print $1 " " $2 > fn }'
real    0m3.801s
user    0m3.796s
sys     0m0.479s

From comments:

1) Please add a line to explain why we need the bracketed expression in fn != ($3 ""):

The awk construct of fn != ($3 "") {action} is an effective shortcut for fn != $3 || fn=="" {action} use the one you consider most readable.

2) Not sure if this also works if the file is larger than the available memory, so this might be a limiting factor.:

I ran the first and the last awk with 300 million records and 20,000 output files. The last one with sort did the task in 12 minutes. The first took 10 hours...

It may be that the sort version actually scales better since opening appending and closing 20,000 files 300 million times takes a while. It is more efficient to gang up and stream similar data.

3) I was thinking about sort earlier but then felt it might not be the fastest because we have to read the whole file twice with this approach.:

This is the case for purely random data; if the actual data is somewhat ordered, there is a tradeoff with reading the file twice. The first awk would be significantly faster with less random data. But then it will also take time to determine if the file is sorted. If you know file is mostly sorted, use the first; if it is likely somewhat disordered, use the last.

like image 146
dawg Avatar answered Jan 12 '23 03:01

dawg


You can use awk:

awk -F $'\t' '{ print $1 " " $2 >> $3; close($3) }' file
like image 23
oguz ismail Avatar answered Jan 12 '23 03:01

oguz ismail