I have database with user subscriptions to topics. There is currently about 20 000 topics, 20 mln users and 200 mln subscriptions stored in SQL database. Because of its size, the database is partitioned by topics, so I can't get the info in one database query. There are couple of topics with 10 mln subscriptions, couple with 100 000 and others have hundreds or less.
When an event occurs, it usually matches couple of topics, so to inform users, I need to perform query like "give me all users subscribed to topics x, y, z and perform union of sets", so that one user gets the news once even if he subscribed both topics x and z.
The constraints are:
I thought about using set of bloom filters for every topic, but they constraints are the other way round: "user either not subscribed for sure or probably subscribed". I need something like "user subscribed for sure or probably not".
Lossy hash tables might be good idea, but I am not sure, if they can be as memory efficient as bloom filters and I am afraid, that it would be always the same user, that is missing the content in his topic.
Do you know any other data structures, that mey be good for solving this problem?
What if each user record had a BIT FIELD representing all of the topics.
TABLE Users(ID INT, UserName VARCHAR(16), Topics BINARY(8000))
A binary 8k would allow you to have 64000 topics. I would probably use multiple columns of BINARY(1024) each so I could add more topics easily.
Now when an event comes in that's tagged for topics 1, 10, 20, 30, 40. I have to search every User, but this can be parallelized and will always be N complexity where N is the number of total users.
SELECT ID
FROM Users (READPAST)
WHERE
SUBSTRING(Topics, 1 / 8, 1) & (1 * POWER(2, (1 % 8))) > 0
OR
SUBSTRING(Topics, 10 / 8, 1) & (1 * POWER(2, (10 % 8))) > 0
OR
SUBSTRING(Topics, 20 / 8, 1) & (1 * POWER(2, (20 % 8))) > 0
OR
SUBSTRING(Topics, 30 / 8, 1) & (1 * POWER(2, (30 % 8))) > 0
OR
SUBSTRING(Topics, 40 / 8, 1) & (1 * POWER(2, (40 % 8))) > 0
OPTION (MAXDOP = 64)
As I said in comments, a memory-based exact solution is certainly feasible.
But if you really want an approximate data structure, then what you're looking for a size-limited set (of users for each topic) with random eviction.
You also need to compute unions rapidly on the fly when queries arrive. There's no helpful pre-computation here. If topic sets tend to repeat, you can look at caching the frequently used unions.
All the usual methods of representing a set apply. Hash tables (both closed and open), trees, and skip lists (all containing user id keys; no values required) are most likely.
If you use a closed hash table with a good hash function, pseudo-random eviction happens automatically. On collision, just replace the previous value. The problem with closed hashes is always picking a good table size for the set you need to represent. Remember that to recover set elements, you'll have to traverse the whole open table including null entries, so starting with a huge table isn't a good idea; rather start with a small one and reorganize, growing by a factor each time so reorganization amortizes to constant time overhead per element stored.
With the other schemes, you can literally do pseudo-random eviction when the table gets too big. The easiest way to evict fairly is store the user id's an a table and have the size-limited set store indices. Evict by generating a random index into the table and removing that id before adding a new one.
It's also possible to evict fairly from a BST set representation by using an order statistic tree: store the number of descendants in each node. Then you can always find the n'th element in key sorted order, where n is pseudo-random, and evict it.
I know you were looking for the bitwise space efficiency of a Bloom filter, but guaranteeing no false positives seems to rule that out.
This might not be the solution you were looking for, but you could utilize ElasticSearch's terms filter and to have one document like this for each user:
{
"id": 12345,
"topics": ["Apache", "GitHub", "Programming"]
}
Terms filters directly responds to the query "which users subscribe to at least one of these topics" and ES is very smart on how to cache and re-utilize filters.
It wouldn't be a probabilistic data structure but would very efficiently solve this problem. You'd need to use scan api for serializing retrieving potentially large JSON responses. If necessary you can scale this solution to billions of users spread on multiple computers and have response times like 10 - 100 milliseconds. You could also find correlations between topics (significant terms aggregation) and use ES as an engine for further analysis.
Edit: I implemented searching and scan / sroll API usage in Python and got some interesting results. I did "users who subscribe to any three of these topics" queries with that 20m users and 200m subscriptions dataset, and in general the search itself finishes in 4 - 8 milliseconds. Queries return 350.000 - 750.000 users.
Problems arise from getting user ids out of ES, even with the scan/scroll API. On Core i5 I seems to get only 8200 users / second so it is less than 0.5 million / minute (with "_source": false
). The query itself looks like this:
{
"filtered": {
"filter": {
"terms": {
"topics": [ 123, 234, 345 ],
"execution": "plain",
"_cache": false
}
}
}
}
In production I would use "execution": "bool"
so that partial query results can be cached and re-utilized at other queries. I don't know what is the bottle-neck with getting results out, server's CPU usage is 50% and I run the client's python script at the same machine, utilizing elasticsearch.helpers.scan
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With