I need to write a program that counts the number of times two users are in the same group. The users are given by username and groups by id. For example, with the input (stored in a text-file):
john 32
john 21
jim 21
jim 32
bob 32
I want the result:
john-jim 2
john-bob 1
jim-bob 1
This sounds trivial. But the problem is: I have 1,8 million groups and 300,000 users. And a lot of memberships (I'm expecting at least an average of 50 per user, possibly much more). This means a HUGE amount of data and processing.
I've written 5 different programs doing this, none of which has been able to cut the amount of data: It was too slow as an PostgreSQL query. Too memory consuming running in a Map in Java working memory (first heap space, after optimization I got the rare "GC overhead limit exceeded"). Too slow to write continuously to database from Java (even when optimized using batch-queries). Growing increasingly desperate, I've tried some more exotic things, like writing all the pairs to an array, then sorting them (O(n log (n))) and then counting them peu à peu. But it was still way too much data to store in memory.
Any ideas on an algorithm for doing this? Or is it impossible?
An RDBMS is specialized in operations like sorting. Doing this outside the DB will hardly ever come even close in performance. Do it with SQL!
This would do the job (simplified in update):
SELECT t1.usr || '-' || t2.usr, count(*) AS ct
FROM usr_grp t1
JOIN usr_grp t2 USING (grp_id)
WHERE t2.usr > t1.usr -- prevent dupes and get sorted pair
GROUP BY t1.usr, t2.usr;
Depending on how many overlaps you have, this potentially produces a HUGE amount of rows, as you said. So this is never going to be fast.
Raises the question: What's the purpose of producing millions of rows that nobody can process? Are you sure, the operation makes sense to begin with?
To make it faster, you could ..
We always recommend that all users run the latest available minor release for whatever major version is in use.
integer
as surrogate key for users, so you deal with integers only in usr_grp
. Makes table and indexes smaller and processing faster. If the n:m table (usr_grp
) has a much bigger cardinality than the table usr
, this should be faster, even if it means additional joins.SELECT u1.usr || '-' || u2.usr, count(*) AS ct
FROM usr_grp t1
JOIN usr_grp t2 USING (grp_id)
JOIN usr u1 ON t1.usr_id = u1.usr_id
JOIN usr u2 ON t2.usr_id = u2.usr_id
WHERE t2.usr_id > t1.usr_id
GROUP BY u1.usr_id, u2.usr_id;
grp_id
must come first. Why does this matter?
CREATE INDEX usr_grp_gu_idx ON usr_grp(grp_id, usr_id);
work_mem
and shared_buffers
.I took the numbers @OldCurmudgeon reported for his test case and created a comparable test case in PostgreSQL.
-> SQLfiddle demo.
~ 250 ms in this public test database.
The result is not ordered (no ORDER BY
) since this hasn't been specified.
As compared to 2.5 minutes, reported below. Factor 600.
How about letting your file system do it.
For each entry - open a file named for the group ID and append the new user's name. You will end up with one file per group.
You now have - for example:
Group-21.txt
jim
john
Group-32.txt
bob
jim
john
Now run through all files, generating every user-name pair in it (I would sort the names and perform a standard combination process on them). For each pair, append a "1" to a file with a specific name.
You now have - for example:
User-jim-john.txt
11
User-bob-jim.txt
1
User-bob-john.txt
1
You now have the pairs in file names and counts (in unary so all you really need is the file size in bytes) in the files.
Almost all of this could be done in parallel although phase 1 must complete before phase 2 begins. To improve speed - add cores - buy a faster disk. There is no memory limit, just disk.
Added: I've just run some simulation tests on this algorithm using just one thread
1800 groups, 300 users and 15000 memberships all randomly generated took about 2.5 minutes. 900 groups, 150 users and 7500 memberships took 54 seconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With