Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for counting common group memberships with big data

I need to write a program that counts the number of times two users are in the same group. The users are given by username and groups by id. For example, with the input (stored in a text-file):

john 32
john 21
jim 21
jim 32
bob 32

I want the result:

john-jim 2 
john-bob 1
jim-bob 1

This sounds trivial. But the problem is: I have 1,8 million groups and 300,000 users. And a lot of memberships (I'm expecting at least an average of 50 per user, possibly much more). This means a HUGE amount of data and processing.

I've written 5 different programs doing this, none of which has been able to cut the amount of data: It was too slow as an PostgreSQL query. Too memory consuming running in a Map in Java working memory (first heap space, after optimization I got the rare "GC overhead limit exceeded"). Too slow to write continuously to database from Java (even when optimized using batch-queries). Growing increasingly desperate, I've tried some more exotic things, like writing all the pairs to an array, then sorting them (O(n log (n))) and then counting them peu à peu. But it was still way too much data to store in memory.

Any ideas on an algorithm for doing this? Or is it impossible?

like image 522
dottorep Avatar asked Apr 05 '13 09:04

dottorep


2 Answers

An RDBMS is specialized in operations like sorting. Doing this outside the DB will hardly ever come even close in performance. Do it with SQL!

This would do the job (simplified in update):

SELECT t1.usr || '-' || t2.usr, count(*) AS ct
FROM   usr_grp t1
JOIN   usr_grp t2 USING (grp_id) 
WHERE  t2.usr > t1.usr   -- prevent dupes and get sorted pair
GROUP  BY t1.usr, t2.usr;

Depending on how many overlaps you have, this potentially produces a HUGE amount of rows, as you said. So this is never going to be fast.

Raises the question: What's the purpose of producing millions of rows that nobody can process? Are you sure, the operation makes sense to begin with?

To make it faster, you could ..

  • Upgrade! PostgreSQL 8.4 is rather outdated by now. In particular, PostgreSQL 9.2 had its focus on big data. You can expect much better performance for a job like this.
    And nobody should be running 8.4.0. For security reasons alone, but you are missing out on lot of bug-fixes, too. The current point-release is 8.4.17. I quote the linked web-site:

We always recommend that all users run the latest available minor release for whatever major version is in use.

  • use an integer as surrogate key for users, so you deal with integers only in usr_grp. Makes table and indexes smaller and processing faster. If the n:m table (usr_grp) has a much bigger cardinality than the table usr, this should be faster, even if it means additional joins.

SELECT u1.usr  || '-' || u2.usr, count(*) AS ct
FROM   usr_grp t1
JOIN   usr_grp t2 USING (grp_id) 
JOIN   usr u1 ON t1.usr_id = u1.usr_id
JOIN   usr u2 ON t2.usr_id = u2.usr_id
WHERE  t2.usr_id > t1.usr_id
GROUP  BY u1.usr_id, u2.usr_id;
  • Create a multi-column index (if you don't have it yet).
    grp_id must come first. Why does this matter?

    CREATE INDEX usr_grp_gu_idx ON usr_grp(grp_id, usr_id);
  • Put a lot of RAM into your machine and increase the settings for work_mem and shared_buffers.

Test case

I took the numbers @OldCurmudgeon reported for his test case and created a comparable test case in PostgreSQL.

-> SQLfiddle demo.

~ 250 ms in this public test database.
The result is not ordered (no ORDER BY) since this hasn't been specified.
As compared to 2.5 minutes, reported below. Factor 600.

like image 80
Erwin Brandstetter Avatar answered Oct 14 '22 14:10

Erwin Brandstetter


How about letting your file system do it.

For each entry - open a file named for the group ID and append the new user's name. You will end up with one file per group.

You now have - for example:

Group-21.txt
 jim
 john

Group-32.txt
 bob
 jim
 john

Now run through all files, generating every user-name pair in it (I would sort the names and perform a standard combination process on them). For each pair, append a "1" to a file with a specific name.

You now have - for example:

User-jim-john.txt
 11

User-bob-jim.txt
 1

User-bob-john.txt
 1

You now have the pairs in file names and counts (in unary so all you really need is the file size in bytes) in the files.

Almost all of this could be done in parallel although phase 1 must complete before phase 2 begins. To improve speed - add cores - buy a faster disk. There is no memory limit, just disk.

Added: I've just run some simulation tests on this algorithm using just one thread

1800 groups, 300 users and 15000 memberships all randomly generated took about 2.5 minutes. 900 groups, 150 users and 7500 memberships took 54 seconds.

like image 33
OldCurmudgeon Avatar answered Oct 14 '22 15:10

OldCurmudgeon