After processing those millions of events/data, where is the best place to storage the information to say that worth to save millions of events? I saw a pull request closed by this commit mentioning Parquet formats, but, the default is the HDFS? My concern is after saving (where?) if it is easy (fast!) to retrieved that data?
Apache Flink is not coupled with specific storage engines or formats. The best place to store the results computed by Flink depends on your use case.
Depending on the answer to these questions, you can choose from various storage backends such as - Apache HDFS for batch access (with different storage format such as Parquet, ORC, custom binary) - Apache Kafka if you want to access the data as a stream - a key-value store such as Apache HBase and Apache Cassandra for point access to data - a database such as MongoDB, MySQL, ...
Flink provides OutputFormats for most of these systems (some through a wrapper for Hadoop OutputFormats). The "best" system depends on your use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With