I have about 30 500MB files, one word per line. I have a script that does this, in pseudo-bash:
for i in *; do
echo "" > everythingButI
for j in *-except-$i; do
cat $j >> everythingButI
sort everythingButI | uniq > tmp
mv tmp everythingButI
done
comm $i everythingButI -2 -3 > uniqueInI
percentUnique=$(wc -l uniqueInI) / $(wc -l $i) * 100
echo "$i is $percentUnique% Unique"
done
It computes the 'uniqueness' of each file (the files are already sorted and unique within each file).
So if I had files:
file1 file2 file3
a b 1
c c c
d e e
f g
h
file1 would be 75% unique (because 1/4 of it's lines are found in another file), file2 would be 60% unique, and file3 would be 33.33% unique. But make it 30 files at 500MB a pop, and it takes a bit to run.
I'd like to write a python script that does this much, much faster, but I'm wondering what the fastest algorithm for this would actually be. (I only have 2GB of RAM on the PC also.)
Anyone have opinions about algorithms, or know of a faster way to do this?
EDIT: Since each of the inputs are already internally sorted and deduplicated, you actually need an n-way merge for this, and the hash-building exercise in the previous version of this post is rather pointless.
The n-way merge is kind of intricate if you're not careful. Basically, it works something like this:
I've left out the use of a priority queue that's in the full form of the merge algorithm; that only becomes significant if you have a large enough number of input files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With