Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "streaming" mean in Apache Spark and Apache Flink?

As I went to Apache Spark Streaming Website, I saw a sentence:

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

And in Apache Flink Website, there is a sentence:

Apache Flink is an open source platform for scalable batch and stream data processing.

What means streaming application and batch data processing, stream data processing? Can you give some concrete examples? Are they designed for sensor data?

like image 659
xirururu Avatar asked Jun 30 '15 10:06

xirururu


People also ask

What is Apache Flink?

Apache Flink is an open-source platform for distributed stream and batch data processing. Flink’s core is a streaming data flow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

What is Apache Spark Streaming?

Spark Streaming (or, properly speaking, 'Apache Spark Streaming') is a software system for processing streams. Spark Streaming analyses streams in real-time. In reality, no system currently processes streams in genuine real-time. There is always a delay because data arrives in portions that analytical engines can consume.

What is Flink and spark?

Flink and Spark are in-memory databases that do not persist their data to storage. They can write their data to permanent storage, but the whole point of streaming is to keep it in memory, to analyze current data. All of this lets programmers write big data programs with streaming data.

What is the difference between streaming and batch processing in spark?

Note: Streaming deals with data in motion, while batch processing deals with data at rest. Apache Spark (or just Spark) is an open-source computing engine. The Spark engine works hand-in-hand with separate code libraries as an integrated system. The system handles humongous amounts of data by using parallel processing across computer clusters.


1 Answers

Streaming data analysis (in contrast to "batch" data analysis) refers to a continuous analysis of a typically infinite stream of data items (often called events).

Characteristics of Streaming Applications

Stream data processing applications are typically characterized by the following points:

  • Streaming applications run continuously, for a very long time, and consume and process events as soon as they appear. In contrast. batch applications gather data in files or databases and process it later.

  • Streaming applications frequently concern themselves with the latency of results. The latency is the delay between the creation of an event and the point when the analysis application has taken that event into account.

  • Because streams are infinite, many computations cannot refer not to the entire stream, but to a "window" over the stream. A window is a view of a sub-sequence of the stream events (such as the last 5 minutes). An example of a real world window statistic is the "average stock price over the past 3 days".

  • In streaming applications, the time of an event often plays a special role. Interpreting events with respect to their order in time is very common. While certain batch applications may do that as well, it not a core concept there.

Examples of Streaming Applications

Typical examples of stream data processing application are

  • Fraud Detection: The application tries to figure out whether a transaction fits with the behavior that has been observed before. If it does not, the transaction may indicate an attempted misuse. Typically very latency critical application.

  • Anomaly Detection: The streaming application builds a statistical model of the events it observes. Outliers indicate anomalies and may trigger alerts. Sensor data may be one source of events that one wants to analyze for anomalies.

  • Online Recommenders: If not a lot of past behavior information is available on a user that visits a web shop, it is interesting to learn from her behavior as she navigates the pages and explores articles, and to start generating some initial recommendations directly.

  • Up-to-date Data Warehousing: There are interesting articles on how to model a data warehousing infrastructure as a streaming application, where the event stream is sequence of changes to the database, and the streaming application computes various warehouses as specialized "aggregate views" of the event stream.

  • There are many more ...

like image 149
Stephan Ewen Avatar answered Sep 21 '22 15:09

Stephan Ewen