Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to count number of sets an item appears in?

I've got a list of bookmarks. Each bookmark has a list of keywords (stored as a HashSet). I also have a set of all possible keywords ("universe").

I want to find the keyword that appears in the most bookmarks.

I have 1356 bookmarks with a combined total of 698,539 keywords, with 187,358 unique.

If I iterate through every keyword in the universe and count the number of bookmarks it appears in, I'm doing 254,057,448 checks. This takes 35 seconds on my machine.

The algorithm is pretty simple:

var biggest = universe.MaxBy(kw => bookmarks.Count(bm => bm.Keywords.Contains(kw)));

Using Jon Skeet's MaxBy.

I'm not sure it's possible to speed this up much, but is there anything I can do? Perhaps parallelize it somehow?


dtb's solution takes under 200 ms to both build the universe and find the biggest element. So simple.

var freq = new FreqDict();
foreach(var bm in bookmarks) {
    freq.Add(bm.Keywords);
}
var biggest2 = freq.MaxBy(kvp => kvp.Value);

FreqDict is just a little class I made built on top of a Dictionary<string,int>.

like image 395
mpen Avatar asked Jan 15 '23 20:01

mpen


1 Answers

You can get all keywords, group them, and get the biggest group. This uses more memory, but should be faster.

I tried this, and in my test it was about 80 times faster:

string biggest =
  bookmarks
  .SelectMany(m => m.Keywords)
  .GroupBy(k => k)
  .OrderByDescending(g => g.Count())
  .First()
  .Key;

Test run:

1536 bookmarks
153600 keywords
74245 unique keywords

Original:
12098 ms.
biggest = "18541"

New:
148 ms.
biggest = "18541"
like image 65
Guffa Avatar answered Jan 23 '23 16:01

Guffa