Design/Code Dispatcher for a Publish-Subscribe System

Tags:

data-structures

A friend of mine was asked this problem in an interview. I would like to discuss this problem here

What can be the efficient implementation for this problem ?

A simple idea which comes to me is normal memqueue , using Memcache machines to scale several requests, with a consumer job running which will write things from memcache to DB. and later on for the second part we can just run a sql query to find list of matching subscribers .

PROBLEM:-

Events get published to this system. Each event can be thought of as containing a fixed number (N) of string columns called C1, C2, … CN. Each event can thus be passed around as an array of Strings (C1 being the 0th element in the array, C2 the 1st and so on).

There are M subscribers – S1, … SM

Each subscriber registers a predicate that specifies what subset of the events it’s interested in. Each predicate can contain:

Equality clause on columns, for example: (C1 == “US”)
Conjunctions of such clauses, example: 
    (C1 == “IN”) && (C2 == “home.php”) 
    (C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)

(In the above examples, C1 stands for the country code of an event and C2 stands for the web page of the site and C3 the referrer code.)

ie. – each predicate is a conjunction of some number of equality conditions. Note that the predicate does not necessarily have an equality clause for ALL columns (ie. – a predicate may not care about the value of some or all columns). (In the examples above: #a does not care about the columns C3, … CN).

We have to design and code a Dispatcher that can match incoming events to registered subscribers. The incoming event rate is in millions per second. The number of subscribers is in thousands. So this dispatcher has to be very efficient. In plain words:

When the system boots, all the subscribers register their predicates to the dispatcher
After this events start coming to the dispatcher
For each event, the dispatcher has to emit the id of the matching subscribers.

In terms of an interface specification, the following can be roughly spelt out (in Java):

Class Dispatcher {

    public Dispatcher(int N /* number of columns in each event – fixed up front */);

    public void registerSubscriber( String subscriberId /* assume no conflicts */,
                                    String predicate /* predicate for this subscriberid */);

    public List<String> findMatchingIds(String[] event /* assume each event has N Strings */);

}

Ie.: the dispatcher is constructed, then a bunch of registerSubscriber calls are made. After this we continuously invoke the method findMatchingIds() and the goal of this exercise is to make this function as efficient as possible.

204

asked Dec 24 '12 16:12

Peter

3 Answers

As Hanno Binder implied, the problem is clearly set up to allow pre-processing the subscriptions to obtain an efficient lookup structure. Hanno says the lookup should be a map

(N, K) -> set of subscribers who specified K in field N     
(N, "") -> set of subscribers who omitted a predicate for field N

When an event arrives, just look up all the applicable sets and find their intersection. A lookup failure returns the empty set. I'm only recapping Hanno's fine answer to point out that a hash table is O(1) and perhaps faster in this application than a tree. On the other hand, intersecting trees can be faster, O(S + log N) where S is the intersection size. So it depends on the nature of the sets.

Alternative

Here is my alternative lookup structure, again created only once during preprocessing. Begin by compiling a map

(N, K) -> unique token T (small integer)

There is also a distinguished token 0 that stands for "don't care."

Now every predicate can be thought of as a regular expression-like pattern with N tokens, either representing a specific event string key or "don't care."

We can now build a decision tree in advance. You can also think of this tree is a Deterministic Finite Automaton (DFA) for recognizing the patterns. Edges are labeled with tokens, including "don't care". A don't care edge is taken if no other edge matches. Accepting states contain the respective subscriber set.

Processing an event starts with converting the keys to a token pattern. If this fails due to a missing map entry, there are no subscribers. Otherwise feed the pattern to the DFA. If the DFA consumes the pattern without crashing, the final state contains the subscriber set. Return this.

For the example, we would have the map:

(1, "IN") -> 1
(2, "home.php") -> 2
(2, "search.php") -> 3
(3, "nytimes.com") -> 4

For N=4, the DFA would look like this:

o --1--> o --2--> o --0--> o --0--> o
          \
            -3--> o --4--> o --0--> o

Note that since there are no subscribers who don't care about e.g. C1, the starting state doesn't have a don't care transition. Any event without "IN" in C1 will cause a crash, and the null set will be properly returned.

With only thousands of subscribers, the size of this DFA ought to be reasonable.

Processing time here is of course O(N) and could be very fast in practice. For real speed, the preprocessing could generate and compile a nest of C switch statements. In this fashion you might actually get millions of events per second with a small number of processors.

You might even be able to coax a standard tool like the flex scanner generator to do most of the work for you.

148

answered Oct 19 '22 04:10

Gene

A solution that comes to my mind would be:

For each Cn we have a mapping from values to sets of subscribers for those subscribers who subscribed for a value of Cn. Additionally, for each Cn we have a set of subscribers who don't care for the value of Cn ('ANY').

When receiving an event, we look up all the subscribers with matching subscriptions for Cn and receive a set with 0 or more subscribers. To this set we add those subscribers from the 'ANY' set for this Cn.

We do this for every n <= N, yielding n sets of subscribers. The intersection of all n sets is the set of subscribers matching this event.

The mapping from Cn to subscribers can efficiently be stored as a tree, which gives a complexity O(k) = log(k) to look up the subscribers for a single Cn, given that there are subscriptions to k different values.

Thus, for n values we have a complexity of O(n,k) = n * log(k).

Intersecting n sets can also be done in O(n,m) = n * log(m), so that we end up with a logarithmic complexity in total, which shouldn't be too bad.

answered Oct 19 '22 04:10

JimmyB

Interesting.

My initial thoughts. I feel it would be easier if the subscriber predicates for e.g.

(C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)

that come to the Dispatcher

public void registerSubscriber

method needs to be flattened so that it is much performance friendly for comparison. Something like below (wild guess)

C1IN|C2search.php|C3nytimes.com

Then a map needs to be maintained in the memory with event string and subscriber ids

In the

findMatchingIds

method - the String array of events also need to be flattened with the similar rules so that a look up can be done for the matching subscriber id

This way the Dispatchers can be scaled horizontally serving many events in parallel

answered Oct 19 '22 03:10

basav

Related questions
                            
                                Looking for a good introduction on trie [closed]
                            
                                print boundary of binary tree
                            
                                How can I modify this line of sight algorithm to accept rays which pass through corners?
                            
                                What is the complexity (Big-O) of this algorithm?
                            
                                Algorithm analysis: Am I analyzing these algorithms correctly? How to approach problems like these [closed]
                            
                                Algorithm for finding dead pixels in Javascript
                            
                                check if point exists in QPainterPath
                            
                                number of binary heaps from 1..n
                            
                                How to create an outline of an image if I know the background color?
                            
                                Rtrees - basics of the algorithm
                            
                                How to check if a circular single linked list is pallindrome or not?
                            
                                Relationship between BFS and topological sort
                            
                                Combinatorial algorithm for assigning people to groups
                            
                                What type of smoothing to use?
                            
                                Calculate all possible combinations
                            
                                Distributed rate limiting algorithm [closed]
                            
                                How to traverse a graph in parallel with 2 or n processes
                            
                                Eliminating rounding errors from matrix
                            
                                sorting algorithm to keep sort position numbers updated
                            
                                Android : Get the maximum value of light sensor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With