Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

large-scale document co-occurrence analysis

I have about 1000 files. Each of which contains about 20,000 documents. I also have a list of about 1,000,000 words.

I want to calculate how many time each word occurs with any other words. So, there is a sparse matrix of size 1M X 1M.

To speed up computation, I'm working on each file separately by doing the following:

1- Each core in my machine is processing a single file and outputing a file of the following format

WordId1 WordId2 Frequency 

2- After doing each file, I merge the 1000 file into a single file.

This is my current approach but It takes so long to do it and I assume there should be much efficient way of doing it so your comments are welcome.

like image 787
DotNet Avatar asked Jan 13 '14 11:01

DotNet


2 Answers

I have done some statistics like this, I split the job by two step

step1: multi-thread counting: compute the partition id of each pair and output the respecting partition-file directly ( partition_id = (md5 of pair)/partition_count, the partition process is the keypoint) , ( I have tried to hash_map to stat the data(when the size the larger than thread_hold , output the map_data to file, which save a lot of disk space,and i put output file in different disk,which speed the process a lot)

step2: multi-thread merge: merge the count output by step1 use map(this process is done in memory,if you are short of memory, choose larger partition_count)

notes: it is an easy job by mapreduce, step1 is map phrase, and step2 is reduce phrase, the key process is partiotion process which corresponding to the partition part before reduce process in hadoop

like image 100
michaeltang Avatar answered Sep 28 '22 17:09

michaeltang


I guess you can get reasonable performance by a careful treatment of details. The problematic part seens to be the memory. With enough memory, you could avoid the writing out and merging.

When processing a single document, you could convert it into a BitSet when each bit is set if the corresponding word is present.

Your relation is symmetric, so I hope you only store (a, b, count) with a < b.

You need something like Multiset<Pair<String, String>> for counting, but there are more memory conserving structures. Your words are numbered, so each one can be represented with an int and a pair can be represented with a long. So maybe something like LongIntHashMap would do. You need concurrency, so you could either use atomics for the entries or partition the map into N parts (via some hashing with N beeing bigger than the number of cores) and synchronize. It should be easy enough to build something on top of AtomicIntegerArray.

You didn't say if there's any chance of your result to fit into memory, but if so it could result into a huge speedup.

The requested explanation

The strings are numbered from 0 to one million which fits in an int. Two such number together fit in an long which can be used as a key for the TLongIntHashMap. For each document you identify all relevant String pairs, get the corresponding longs and increment the value in the TLongIntHashMap.

Here, only the increment need to be done under lock. As this locking would hinder concurrency I proposed to use multiple maps, each with its own lock. The incrementing could be grouped, so that multiple operation can be done with a single lock.

A better solution may be to use one TIntIntHashMap per word. Imagine you put all words (represented as ints) found in an document into a Set. Then you can loop like this

for (int w1 : words) {
    getLock(w1).lock();
    TIntIntHashMap map = getMap(w1);
    for (int w2 : words) {
        if (isLess(w1, w2) map.increment(w2);
    }
    getLock(w1).unlock();
}

Here, isLess is an arbitrary antisymmetric irreflexive relation used to avoid storing both (a, b) and (b, a). While simply w1 < w2 would do, it'd lead to rather imbalanced values (getMap(0) would be probably big and getMap(1000000) would be empty). Using ((w1 - w2) ^ ((w1 + w2) << 31)) < 0 should do.

like image 40
maaartinus Avatar answered Sep 28 '22 17:09

maaartinus