Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storage in Apache Flink

Tags:

apache-flink

After processing those millions of events/data, where is the best place to storage the information to say that worth to save millions of events? I saw a pull request closed by this commit mentioning Parquet formats, but, the default is the HDFS? My concern is after saving (where?) if it is easy (fast!) to retrieved that data?

like image 862
Jonathan Santilli Avatar asked Aug 11 '15 21:08

Jonathan Santilli


1 Answers

Apache Flink is not coupled with specific storage engines or formats. The best place to store the results computed by Flink depends on your use case.

  • Are you running a batch or streaming job?
  • What do you want to do with the result?
  • Do you need batch (full scan), point, or continuously streaming access to the data?
  • What format does the data have? flat structured (relational), nested, blob, ...

Depending on the answer to these questions, you can choose from various storage backends such as - Apache HDFS for batch access (with different storage format such as Parquet, ORC, custom binary) - Apache Kafka if you want to access the data as a stream - a key-value store such as Apache HBase and Apache Cassandra for point access to data - a database such as MongoDB, MySQL, ...

Flink provides OutputFormats for most of these systems (some through a wrapper for Hadoop OutputFormats). The "best" system depends on your use case.

like image 84
Fabian Hueske Avatar answered Oct 04 '22 03:10

Fabian Hueske