I am working on a community detection algorithm for analyzing social network data from Facebook. The first task, detecting all cliques in the graph, can be done efficiently in parallel, and leaves me with an output like this:
17118 17136 17392
17064 17093 17376
17118 17136 17356 17318 12345
17118 17136 17356 17283
17007 17059 17116
Each of these lines represents a unique clique (a collection of node ids), and I want to sort these lines in descending order by the number of ids per line. In the case of the example above, here's what the output should look like:
17118 17136 17356 17318 12345
17118 17136 17356 17283
17118 17136 17392
17064 17093 17376
17007 17059 17116
(Ties---i.e., lines with the same number of ids---can be sorted arbitrarily.)
What is the most efficient way of sorting these lines.
Keep the following points in mind:
UPDATE 2: Best solution
Based on benchmarking the solutions proposed (see below), here is the best solution (taken from Vlad who in turn adapted it from other solutions proposed here). It's quite clever and does not even use sort
for FILE in infile.* ; do
awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \
FILE=`basename $FILE` $FILE&
done
wait
ls -1r tmpfile.* | xargs cat >outfile
rm -f tmpfile.*
UPDATE 1: Benchmarking results of proposed solutions
For benchmarking I took the Cliques found in an Oklahoma State Facebook network. The unsorted files that contain these cliques look just like the first example I show above, contain 46,362,546 lines, which brings the filesize up to 6.4 GB. The cliques are almost evenly spread over 8 files. The system I am testing this on contains 4 physical processors, each with 6 cores and a 12MB L2 cache, for a total of 24 cores. It also contains 128 GB physical memory. Because the lines to be sorted were split into 8 files, most of these solutions used 8 (or 16) concurrent processes.
Ignoring the first naive approach, I benchmarked the last 5 suggestions of Vlad Romascanu (the solution which I have selected).
The first solution was not too efficient:
real 6m35.973s
user 26m49.810s
sys 2m14.080s
I tried using solutions 2, 3, and 4, which use FIFO files, but they each only used one sort process and were thus taking a long time (and so I killed these before they could finish)/
The last solution was the quickest:
real 1m3.272s
user 1m21.540s
sys 1m22.550s
Note that the user time of this solution is 1m21s, much better than the first solutions 26 minutes.
A naive approach could be simply:
awk '{ print NF " " $0 }' infile| sort -k1,1nr |
awk '{ $1=""; print $0 }' >outfile
This will keep up to 3 CPUs busy. sort
is not limited by the amount of physical memory available, use the -S
and -T
switches to configure how much memory to use (-S
) before resorting to temporary files in a temp directory (-T
) on a big enough (and ideally fast) partition.
If you can produce several input files by subdividing the work leading up to the sort phase, you would then be able to do:
for FILE in infile.* ; do
awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.tmp&
done
wait
sort -k1,1nr -m infile.*.tmp | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.tmp
This will use up to N*2
CPUs; moreover, the last sort (merge-sort) is highly efficient.
Refining further to improve parallelism to N*2+1
by using FIFOs instead of intermediate files, again assuming multiple input files are possible:
for FILE in infile.* ; do
mkfifo $FILE.fifo
awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.fifo&
done
sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.fifo
If multiple input files are not possible, you can simulate them (adding I/O overhead which will hopefully be amortized by the number of processes available):
PARALLELISM=5 # I want 5 parallel instances
for N in `seq $PARALLELISM` ; do
mkfifo infile.$N.fifo
awk 'NR % '$PARALLELISM'=='$N' { print NF " " $0 }' infile |
sort -k1,1nr >infile.$N.fifo&
done
sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.fifo
Because we use modulo-line-number we have good locality and the filesystem cache should ideally bring the cost of reading the input file over and over in $PARALLELISM
processes closer to zero.
Even better, reading the input file only once and round-robin-ing input lines into several sort
pipes:
PARALLELISM=5 # I want 5 parallel instances
for N in `seq $PARALLELISM` ; do
mkfifo infile.$N.fifo1
mkfifo infile.$N.fifo2
sort -k1,1nr infile.$N.fifo1 >infile.$N.fifo2&
done
awk '{ print NF " " $0 >("infile." NR % '$PARALLELISM' ".fifo1") }' infile&
sort -k1,1nr -m infile.*.fifo2 | awk '{ $1=""; print $0 }' >outfile
rm -f infile.$N.fifo[12]
You should measure performance for various values of $PARALLELISM
then pick the optimal one.
As shown in other posts, you can of course use cut
instead of the final awk
(i.e. which strips the first column) for potentially better efficiency. :)
Updated all scripts for the filename convention you provided, and fixed a bug in the last version.
Also, using the new filename convention, if I/O is not the bottleneck then a very slight variation on dave
/niry
's solutions should probably be even more efficient:
for FILE in infile.* ; do
awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \
FILE=`basename $FILE` $FILE&
done
wait
ls -1r tmpfile.* | xargs cat >outfile
rm -f tmpfile.*
I wonder how quick this would be:
#!/bin/sh
rm -rf /tmp/fb
mkdir /tmp/fb
cd /tmp/fb
awk '{ print $0 > NF }'
ls | sort -nr | xargs cat
Doesn't take advantage of a lot of cores, though.
Since you don't really need to sort, just copy into buckets, you could split in files by number of tokens, this will be the fastest:
perl -ne 'split/\s+/;$t=$#_+1;open $f[$t], sprintf(">%09d",$t) if $f[$t] eq "";$f=$f[$t];print $f $_;'
cat `ls -1r 0*`
btw,the disk will be the bottleneck, # of cores and usage would not really matter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With