Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Design/Code Dispatcher for a Publish-Subscribe System

A friend of mine was asked this problem in an interview. I would like to discuss this problem here

What can be the efficient implementation for this problem ?

A simple idea which comes to me is normal memqueue , using Memcache machines to scale several requests, with a consumer job running which will write things from memcache to DB. and later on for the second part we can just run a sql query to find list of matching subscribers .

PROBLEM:-

Events get published to this system. Each event can be thought of as containing a fixed number (N) of string columns called C1, C2, … CN. Each event can thus be passed around as an array of Strings (C1 being the 0th element in the array, C2 the 1st and so on).

There are M subscribers – S1, … SM

Each subscriber registers a predicate that specifies what subset of the events it’s interested in. Each predicate can contain:

Equality clause on columns, for example: (C1 == “US”)
Conjunctions of such clauses, example: 
    (C1 == “IN”) && (C2 == “home.php”) 
    (C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)

(In the above examples, C1 stands for the country code of an event and C2 stands for the web page of the site and C3 the referrer code.)

ie. – each predicate is a conjunction of some number of equality conditions. Note that the predicate does not necessarily have an equality clause for ALL columns (ie. – a predicate may not care about the value of some or all columns). (In the examples above: #a does not care about the columns C3, … CN).

We have to design and code a Dispatcher that can match incoming events to registered subscribers. The incoming event rate is in millions per second. The number of subscribers is in thousands. So this dispatcher has to be very efficient. In plain words:

When the system boots, all the subscribers register their predicates to the dispatcher
After this events start coming to the dispatcher
For each event, the dispatcher has to emit the id of the matching subscribers.

In terms of an interface specification, the following can be roughly spelt out (in Java):

Class Dispatcher {

    public Dispatcher(int N /* number of columns in each event – fixed up front */);

    public void registerSubscriber( String subscriberId /* assume no conflicts */,
                                    String predicate /* predicate for this subscriberid */);

    public List<String> findMatchingIds(String[] event /* assume each event has N Strings */);

}

Ie.: the dispatcher is constructed, then a bunch of registerSubscriber calls are made. After this we continuously invoke the method findMatchingIds() and the goal of this exercise is to make this function as efficient as possible.

like image 204
Peter Avatar asked Dec 24 '12 16:12

Peter


People also ask

What design pattern best describes the publish subscribe messaging?

The Publish/Subscribe pattern, also known as pub/sub, is an architectural design pattern that provides a framework for exchanging messages between publishers and subscribers. This pattern involves the publisher and the subscriber relying on a message broker that relays messages from the publisher to the subscribers.

What is publish subscribe system in distributed system?

What is a Publish/Subscribe System? Distributed Pub/Sub System is a communication paradigm that allows freedom in the distributed system by the decoupling of communication entities in terms of time, space and synchronization. An event service system that is asynchronous, anonymous and loosely-coupled.

What is pub/sub design?

Publish/subscribe messaging, or pub/sub messaging, is a form of asynchronous service-to-service communication used in serverless and microservices architectures. In a pub/sub model, any message published to a topic is immediately received by all of the subscribers to the topic.

Is publish subscribe a middleware?

Publish–subscribe is a sibling of the message queue paradigm, and is typically one part of a larger message-oriented middleware system. Most messaging systems support both the pub/sub and message queue models in their API; e.g., Java Message Service (JMS).


3 Answers

As Hanno Binder implied, the problem is clearly set up to allow pre-processing the subscriptions to obtain an efficient lookup structure. Hanno says the lookup should be a map

(N, K) -> set of subscribers who specified K in field N     
(N, "") -> set of subscribers who omitted a predicate for field N

When an event arrives, just look up all the applicable sets and find their intersection. A lookup failure returns the empty set. I'm only recapping Hanno's fine answer to point out that a hash table is O(1) and perhaps faster in this application than a tree. On the other hand, intersecting trees can be faster, O(S + log N) where S is the intersection size. So it depends on the nature of the sets.

Alternative

Here is my alternative lookup structure, again created only once during preprocessing. Begin by compiling a map

(N, K) -> unique token T (small integer)

There is also a distinguished token 0 that stands for "don't care."

Now every predicate can be thought of as a regular expression-like pattern with N tokens, either representing a specific event string key or "don't care."

We can now build a decision tree in advance. You can also think of this tree is a Deterministic Finite Automaton (DFA) for recognizing the patterns. Edges are labeled with tokens, including "don't care". A don't care edge is taken if no other edge matches. Accepting states contain the respective subscriber set.

Processing an event starts with converting the keys to a token pattern. If this fails due to a missing map entry, there are no subscribers. Otherwise feed the pattern to the DFA. If the DFA consumes the pattern without crashing, the final state contains the subscriber set. Return this.

For the example, we would have the map:

(1, "IN") -> 1
(2, "home.php") -> 2
(2, "search.php") -> 3
(3, "nytimes.com") -> 4

For N=4, the DFA would look like this:

o --1--> o --2--> o --0--> o --0--> o
          \
            -3--> o --4--> o --0--> o

Note that since there are no subscribers who don't care about e.g. C1, the starting state doesn't have a don't care transition. Any event without "IN" in C1 will cause a crash, and the null set will be properly returned.

With only thousands of subscribers, the size of this DFA ought to be reasonable.

Processing time here is of course O(N) and could be very fast in practice. For real speed, the preprocessing could generate and compile a nest of C switch statements. In this fashion you might actually get millions of events per second with a small number of processors.

You might even be able to coax a standard tool like the flex scanner generator to do most of the work for you.

like image 148
Gene Avatar answered Oct 19 '22 04:10

Gene


A solution that comes to my mind would be:

For each Cn we have a mapping from values to sets of subscribers for those subscribers who subscribed for a value of Cn. Additionally, for each Cn we have a set of subscribers who don't care for the value of Cn ('ANY').

When receiving an event, we look up all the subscribers with matching subscriptions for Cn and receive a set with 0 or more subscribers. To this set we add those subscribers from the 'ANY' set for this Cn.

We do this for every n <= N, yielding n sets of subscribers. The intersection of all n sets is the set of subscribers matching this event.

The mapping from Cn to subscribers can efficiently be stored as a tree, which gives a complexity O(k) = log(k) to look up the subscribers for a single Cn, given that there are subscriptions to k different values.

Thus, for n values we have a complexity of O(n,k) = n * log(k).

Intersecting n sets can also be done in O(n,m) = n * log(m), so that we end up with a logarithmic complexity in total, which shouldn't be too bad.

like image 30
JimmyB Avatar answered Oct 19 '22 04:10

JimmyB


Interesting.

My initial thoughts. I feel it would be easier if the subscriber predicates for e.g.

(C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)

that come to the Dispatcher

public void registerSubscriber

method needs to be flattened so that it is much performance friendly for comparison. Something like below (wild guess)

C1IN|C2search.php|C3nytimes.com

Then a map needs to be maintained in the memory with event string and subscriber ids

In the

findMatchingIds

method - the String array of events also need to be flattened with the similar rules so that a look up can be done for the matching subscriber id

This way the Dispatchers can be scaled horizontally serving many events in parallel

like image 27
basav Avatar answered Oct 19 '22 03:10

basav