Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Flink: Why to choose the MemoryStateBackend over the FsStateBackend?

Flink has a MemoryStateBackend and a FsStateBackend (and a RocksDBStateBackend). Both seem to extend the HeapKeyedStateBackend, i.e. the mechanism for storing the current working state is entirely the same.

This SO answer says that the main difference lies in the MemoryStateBackend keeping a copy of the checkpoints in the JobManagers memory. (I wasn't able to glean any evidence for that from the source code.) The MemoryStateBackend also limits the maximum state size per subtask.

Now I wonder: Why would you ever want to use the MemoryStateBackend?

like image 914
Caesar Avatar asked Jan 23 '19 01:01

Caesar


People also ask

Does Flink use RocksDB?

The Flink TaskManager is allocated with 1.5 CPU cores and 4 GB memory. The job uses the RocksDB state backend, which is configured to use Flink's managed memory.

What are the state storage supported by Flink?

In addition to RocksDBStateBackend, Flink has two other built-in state backends: MemoryStateBackend and FsStateBackend. They both are heap-based, as in-flight state is stored in the JVM heap.

Does Flink has its own storage?

"Flink does not provide its own data storage system but provides data-source and sink connectors to systems, such as Amazon Kinesis, Apache Kafka, Alluxio, HDFS, Apache Cassandra, and Elasticsearch." Soon, the first half of this sentence may no longer be applicable.

What database does Flink use?

Flink provides a rich set of connectors to various storage systems such as Kafka, Kinesis, Elasticsearch, and JDBC database systems.


1 Answers

As you said, both MemoryStateBackend and FSStateBackend are based on HeapKeyedStateBackend. This means, that both state backends maintain the state of an operator as regular objects on the JVM heap of the TaskManager, i.e., state is always accessed in memory.

The backends differ in how they persist the state for checkpoints. A checkpoint is a copy of the state of all operators of an application that is stored somewhere. In case of a failure, the application is restarted and the state of the operators is initialized from the checkpoint.

The FSStateBackend stores the checkpoint in a file system, typically HDFS, S3, or a NFS that is mounted on all worker nodes. The MemoryStateBackend stores the state in the JVM of the JobManager. This has the following pros and cons:

Pros:

  • No need to setup a (distributed) file system.
  • No need to configure a storage location.

Cons:

  • State is lost if the JobManager process dies.
  • Size of state is bound by the size of the JobManager memory.

Since checkpoints are lost if the JM goes down, the MemoryStateBackend is unsuitable for most production use cases. It can be useful for developing and testing stateful applications, because it requires not configuration or setup.

like image 66
Fabian Hueske Avatar answered Sep 30 '22 09:09

Fabian Hueske