Apache Apex looks similar to Apache Storm.
So, at a glance, both look similar and I'm not quite getting the difference. Can someone please explain what are the key differences? In other words, when should I use one instead of the other?
Apache Storm and Spark are platforms for big data processing that work with real-time data streams. The core difference between the two technologies is in the way they handle data processing. Storm parallelizes task computation while Spark parallelizes data computations.
Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use!
We currently use Storm as our Twitter realtime data processing pipeline. We have Storm topologies for content filtering, geolocalisation and classification.
Apache Flink It can run stateful streaming applications at any scale and execute batch and stream processing without a fuss. With Flink, you can ingest streaming data from many sources, process them, and distribute them across various nodes.
There are fundamental differences in architecture which make each of the platform very different in terms of latency, scaling and state management.
At the very basic level,
You can learn more differences in the following blog which also includes other main stream processing platforms out there.
https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/
Architecture and Features
+-------------------+---------------------------+---------------------+
| | Storm | Apex |
+-------------------+---------------------------+---------------------+
| Model | Native Streaming | Native Streaming |
| | Micro batch (Trident | |
+-------------------+---------------------------+---------------------+
| Language | Java. | Java (Scala) |
| | Ability to use non | |
| | JVM languages support | |
+-------------------+---------------------------+---------------------+
| API | Compositional | Compositional (DAG) |
| | Declarative (Trident) | Declarative |
| | Limited SQL | |
| | support (Trident) | |
+-------------------+---------------------------+---------------------+
| Locality | Data Locality | Advance Processing |
+-------------------+---------------------------+---------------------+
| Latency | Low | Very Low |
| | High (Trident) | |
+-------------------+---------------------------+---------------------+
| Throughput | Limited in Ack mode | Very high |
+-------------------+---------------------------+---------------------+
| Scalibility | Limited due to Ack | Horizontal |
+-------------------+---------------------------+---------------------+
| Partitioning | Standard | Advance |
| | Set parallelism at work, | Parallel pipes, |
| | executor and task level | unifiers |
+-------------------+---------------------------+---------------------+
| Connector Library | Limited (certification) | Rich library of |
| | | connectors in |
| | | Apex Malhar |
+-------------------+---------------------------+---------------------+
Operability
+------------+--------------------------+---------------------+
| | Storm | Apex |
+------------+--------------------------+---------------------+
| State | External store | Checkpointing |
| Management | Limited checkpointing | Local checkpointing |
| | Difficult to exploit | |
| | local state | |
+------------+--------------------------+---------------------+
| Recovery | Cumbersome API to | Incremental |
| | store and retrieve state | (buffer server) |
| | Require user code | |
+------------+--------------------------+---------------------+
| Processing | At least once | |
| Semantic | Exactly once require | At least once |
| | user code and affect | End to end |
| | latency | |
| | | exactly once |
+------------+--------------------------+---------------------+
| Back | Watermark on queue | Automatic |
| Pressure | size for spout and bolt | Buffer server |
| | Does not scale | memory and disk |
+------------+--------------------------+---------------------+
| Elasticity | Through CLI only | Yes w/ full user |
| | | control |
+------------+--------------------------+---------------------+
| Dynamic | No | Yes |
| topology | | |
+------------+--------------------------+---------------------+
| Security | Kerberos | Kerberos, RBAC, |
| | | LDAP |
+------------+--------------------------+---------------------+
| Multi | Mesos, RAS - memory, | YARN |
| Tenancy | CPU, YARN | full isolation |
+------------+--------------------------+---------------------+
| DevOps | REST API | REST API |
| Tools | Basic UI | DataTorrent RTS |
+------------+--------------------------+---------------------+
Source: Webinar: Apache Apex (Next Gen Hadoop) vs. Storm - Comparison and Migration Outline https://www.youtube.com/watch?v=sPjyo2HfD_I
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With