I have about 1000 files. Each of which contains about 20,000 documents. I also have a list of about 1,000,000 words.
I want to calculate how many time each word occurs with any other words. So, there is a sparse matrix of size 1M X 1M.
To speed up computation, I'm working on each file separately by doing the following:
1- Each core in my machine is processing a single file and outputing a file of the following format
WordId1 WordId2 Frequency
2- After doing each file, I merge the 1000 file into a single file.
This is my current approach but It takes so long to do it and I assume there should be much efficient way of doing it so your comments are welcome.
I have done some statistics like this, I split the job by two step
step1: multi-thread counting: compute the partition id of each pair and output the respecting partition-file directly ( partition_id = (md5 of pair)/partition_count, the partition process is the keypoint) , ( I have tried to hash_map to stat the data(when the size the larger than thread_hold , output the map_data to file, which save a lot of disk space,and i put output file in different disk,which speed the process a lot)
step2: multi-thread merge: merge the count output by step1 use map(this process is done in memory,if you are short of memory, choose larger partition_count)
notes: it is an easy job by mapreduce, step1 is map phrase, and step2 is reduce phrase, the key process is partiotion process which corresponding to the partition part before reduce process in hadoop
I guess you can get reasonable performance by a careful treatment of details. The problematic part seens to be the memory. With enough memory, you could avoid the writing out and merging.
When processing a single document, you could convert it into a BitSet
when each bit is set if the corresponding word is present.
Your relation is symmetric, so I hope you only store (a, b, count)
with a < b
.
You need something like Multiset<Pair<String, String>>
for counting, but there are more memory conserving structures. Your words are numbered, so each one can be represented with an int
and a pair can be represented with a long
. So maybe something like LongIntHashMap would do. You need concurrency, so you could either use atomics for the entries or partition the map into N
parts (via some hashing with N
beeing bigger than the number of cores) and synchronize. It should be easy enough to build something on top of AtomicIntegerArray
.
You didn't say if there's any chance of your result to fit into memory, but if so it could result into a huge speedup.
The strings are numbered from 0 to one million which fits in an int
. Two such number together fit in an long
which can be used as a key for the TLongIntHashMap
. For each document you identify all relevant String pairs, get the corresponding long
s and increment the value in the TLongIntHashMap
.
Here, only the increment need to be done under lock. As this locking would hinder concurrency I proposed to use multiple maps, each with its own lock. The incrementing could be grouped, so that multiple operation can be done with a single lock.
A better solution may be to use one TIntIntHashMap
per word. Imagine you put all words (represented as int
s) found in an document into a Set. Then you can loop like this
for (int w1 : words) {
getLock(w1).lock();
TIntIntHashMap map = getMap(w1);
for (int w2 : words) {
if (isLess(w1, w2) map.increment(w2);
}
getLock(w1).unlock();
}
Here, isLess
is an arbitrary antisymmetric irreflexive relation used to avoid storing both (a, b)
and (b, a)
. While simply w1 < w2
would do, it'd lead to rather imbalanced values (getMap(0)
would be probably big and getMap(1000000)
would be empty). Using ((w1 - w2) ^ ((w1 + w2) << 31)) < 0
should do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With