When playing trading card games, I frequently wonder what would be the most efficient data structure to deal with the following problem.
In such games, I face an opponent with a deck that contains N cards (N ~ 30..60..100), each of them is chosen out of possible M card types (M ~ typically 1000..10000s). Cards are generally not required to be unique, i.e. there can be repeated card types. The contents of opponent's deck are unknown before the game.
As the game starts and progresses, I slowly learn card-by-card, which cards an opponent uses. There is a dataset that includes full contents of K (K ~ typically 100000..100000s) of the decks seen previously. I want to query this dataset using progressively increasing sample I've obtained in a certain game to make a ranked list of possible decks an opponent uses.
What would be the most efficient data structure to do such querying, given mentioned limits on reasonably modern hardware (i.e. several gigabytes of RAM available)?
known K decks:
d1 = [1, 4, 6, 3, 4]
d2 = [5, 3, 3, 9, 5]
d3 = [5, 10, 4, 10, 1]
d4 = [3, 7, 1, 8, 5]
on turn 1, I reveal that an opponent uses card #5; thus, my list of candidates is reduced to:
d2 = [5, 3, 3, 9, 5] - score 2
d3 = [5, 10, 4, 10, 1] - score 1
d4 = [3, 7, 1, 8, 5] - score 1
d2 is ranked higher than the rest in the results, because there are double 5s in that deck, so it's probably more likely that it is
on turn 2, I reveal that an opponent uses card #1; list of candidates is reduced to:
d3 = [5, 10, 4, 10, 1]
d4 = [3, 7, 1, 8, 5]
The trivial solution is, of course, to store K decks as an arrays of N integers. Getting match score for a given query of p cards revealed for one deck thus takes O(N*p) checks. Each time we see a match, we just increase the score by 1. Thereby, checking all K known decks against a query of p cards would take O(KNp), that is roughly 100000 * 100 * 100 operations in worst case => 1e9, that's lots of work.
We can set up an index that will hold a list of pointers to decks that card is encountered in for every known card type — however, it doesn't solve the problem of intersecting all these lists (and they are going to be huge, there might be cards that are found in 90..95% of known decks). For a given p card lookup, it boils down to intersecting p lists of K decks pointers, calculating intersection scores in process. Roughly, that is O(Kp), but with a fairly large constant. It's still 1e7 operations in late stages.
However, if we'll use the fact that every next turn in fact restricts our dataset further, we can reapply filtering to whatever came up on previous query. This way, it would be O(K) every turn => 1e5 operations.
Is there a way to perform better, ideally, not depending on value of K?
There are two things you can do to speed this up. First, create an inverted index that tells you which decks contain each card. So in your example decks above:
d1 = [1, 4, 6, 3, 4]
d2 = [5, 3, 3, 9, 5]
d3 = [5, 10, 4, 10, 1]
d4 = [3, 7, 1, 8, 5]
Your index is:
1: d1, d3, d4
3: d1, d2, d4
4: d1(2), d3
5: d2(2), d3, d4
6: d1
7: d4
8: d4
9: d2
10: d3(2)
It should be clear that this takes the about the same amount of memory as the decks themselves. That is, rather than having N decks of K cards, you have up to M cards, each of which has up to N deck references.
When the user turns over the first card, 5, you quickly look up 5 in your index and you get the candidate lists [d2,d3,d4]
.
Here's the second optimization: you keep that list of candidates around. You're no longer interested in the rest of the decks; they have been eliminated from the list of candidates. When the next card, 1, is revealed, you look up 1 in your index and you get [d1,d3,d4]
. You intersect that with the first list of candidates to produce [d3,d4]
.
In the worst possible case, you'd end up doing N intersections (one per card) of K items each (if the decks are all very similar). But in most cases the number of decks that a card is in will be much smaller than K, so your candidate list length will likely shrink very quickly.
Finally, if you store the deck references as hash maps then the intersection goes very quickly because you only have to look for items from the (usually small) existing candidate list in the large list of items for the next card turned over. Those lookups are O(1).
This is the basic idea of how a search engine works. You have a list of words, each of which contains references to the documents the word appears in. You can very quickly narrow a list of documents from hundreds of millions to just a handful in short order.
Your idea with intersecting p lists of deck pointers is good, but you're missing some optimizations.
Sort the decks by some criteria (i.e. deck index) and use binary search to advance through the lists (take the smallest deck id using a heap and advance it to match or exceed to current largest deck id). This way you get through them faster, especially if you don't have a lot of decks in the intersection.
Also store the previous intersection so that for the next move you only need to intersect 2 lists (previous result and the new card).
Finally you can simply ignore cards that are too popular and just check for them in the final result.
I would suggest you implement a solution like this and run some benchmarks. It will be faster than O(K).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With