My question has some similarities to this question: Why do we need message brokers like RabbitMQ over a database like PostgreSQL?
In my current (semi-professional) project I'm also at the point to decide whether to go for a database, message broker-based (e.g. with RabbitMQ) or even a totally different solution.
Let's imagine 2 tools, Tool A and Tool B. Whenever Tool A runs and finished, there might be something to do for Tool B. Execution of Tool A takes quiet some time (> 60 sec) and often there will be nothing to do for Tool B. Tool A provides some metadata for Tool B so Tool B knows what to do.
Message-based solution: Establish a message queue which Tool B is consuming. In case Tool A was executed and Tool B should run, Tool A publishes a message (including the metadata) to the queue which Tool B receives so Tool B will run using the metadata from the message.
Database solution: Whenever Tool A is running it adds a database record with e.g. timestamp, the metadata and status "RUNNING". In case Tool A was executed and Tool B should run, it updates the DB record status to "NEXT_TOOL_B". Tool B is constantly querying the DB for records with "NEXT_TOOL_B" status. In case it finds something, Tool B will run using the metadata from the DB records.
While I'm aware of the disadvantages of the database solution e.g. the constant polling from Tool B, I miss one feature of it in the message-based solution:
Whenever a 3rd Tool, say Tool C, e.g. a control panel UI, wants to know the current status it can also query the DB at any time and it will find a "RUNNING" status in case Tool A is still at work. In the message solution, I don't really see a way to "monitor" the status unless the finish message will be on the queue.
So my question is, can you think way to achieve this using messages or any other method that gets along without polling?
A message broker is an architectural pattern for message validation, transformation, and routing. It mediates communication among applications, minimizing the mutual awareness that applications should have of each other in order to be able to exchange messages, effectively implementing decoupling.
Another powerful pattern is for your database to directly publish messages to RabbitMQ. This can be achieved by using extensions or plugins in the database, or by having a RabbitMQ plugin that acts as a database client, publishing messages whenever database events occur.
Firstly, let's be clear, the terms Message Broker and Message Bus are used in architectural patterns for messaging systems, also referred to as messaging topologies. Whilst a Message Bus is one such topology, a Message Broker is only one component in an alternative topology known as Hub and Spoke.
Examples of message brokers The most popular message brokers are RabbitMQ, Apache Kafka, Redis, Amazon SQS, and Amazon SNS. Each of them is a great and powerful tool to use.
The scenario described in the question is that of a system, which is composed of multiple different pieces which work together to achieve a function. In this case, you have three different processes {A,B,C}
, together with a database and optional message queue. All systems, as part of their purpose of being, accept one or more inputs, execute some process, and produce one or more outputs. In your case, one of your outputs desired is the state of the system and its processing, which is not an altogether unreasonable thing to want to have.
Queue or Database?
Now, down to your question. Why use a message queue instead of a database? Both are similar components of a system in that they perform some storage capacity. You might well ask the same question in a refrigerator manufacturing plant- when does it make more sense to use a shelf on the assembly line as opposed to a warehouse?
Databases are like warehouses - they are designed to hold a lot of different things and keep them all relatively straight. A good warehouse allows users to find things in the warehouse quickly, and avoids losing parts and materials. If it goes in, it can easily come back out, but not instantly.
Message queues, on the other hand, are like the shelves located near the operator stations in an assembly line. Parts accumulate there from the previous operation waiting to be consumed by the person running the station. The shelves are designed to hold a small number of the same thing - just like a message queue in a software system. They are close to the worker, so when the next part is ready to be worked, it can be retrieved very quickly (as opposed to a trip to the warehouse, which can take several minutes or more). In addition, the worker has immediate visibility to what's on the shelf - if the shelf is empty, the worker might take a break and wait for it to accumulate a part or two again.
Finally, if one part of the factory grossly over-produces (we don't like it when this happens, because it indicates waste), then the shelves are going to be overwhelmed, and the overage is going to need to be put into the warehouse. Believe it or not, this happens all the time in factories - sometimes stations go down for brief periods of time and the warehouse acts as a longer-term buffer.
When to use one or the other?
So - back to the question. You use a message queue when you expect that your production of messages will usually match the consumption of messages, and you need speed in retrieval. You don't expect things to stay around in the queue very long. Software queue systems, such as RabbitMq, also perform some very specific functions - like ensuring that a job only gets handled by one processor, and that it can get picked up by a different processor if the first goes down.
On the other hand, you would use a database for things which require the persistence of state across multiple processing steps. Your job status is a perfect example of something that should be stored in the database. To continue the factory analogy - think of that as a report that gets sent back to the production planner when each step is completed. The production planner is going to keep it in a database.
You would also want to use a database when there is a likelihood that the queue will get full, or when it's critical that data not get lost between one job step and another. For example, a manufacturing plant will often store its finished products in the warehouse pending shipment to the customer. Use a database for all long-term (more than a few seconds) storage needs in your application.
Bottom Line
Most scalable software systems will have a need for both queues and databases, and the key is knowing when to use each.
Hopefully this makes some degree of sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With