Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the CAP Theorem apply on HDFS?

I just started reading about Hadoop and came across the CAP Theorem. Can you please throw some light on which two components of CAP would be applicable to a HDFS system?

like image 317
Pallav Doshi Avatar asked Nov 11 '19 05:11

Pallav Doshi


People also ask

What is CAP theorem in Hadoop?

The CAP theorem states that it is not possible to guarantee all three of the desirable properties – consistency, availability, and partition tolerance at the same time in a distributed system with data replication.

How does the CAP theorem work?

The CAP theorem is a belief from theoretical computer science about distributed data stores that claims, in the event of a network failure on a distributed database, it is possible to provide either consistency or availability—but not both.

Does CAP theorem apply to Rdbms?

CAP theorem is problematic and it applies only to distributed database systems. When you have distributed databases then network partition and node crashes can happen. And when network partition happens you must have partition tolerance (the P of your CAP). So to answer your question number 1) It's either CP or AP.

What is CAP theorem how it is applicable to NoSQL system?

CAP theorem is known as Brewer's theorem. According to the CAP theorem, there are limitations for the NoSQL database. Against three guarantees of a database, only two can be achieved — consistency, availability and partition tolerance. Answered by Kanak. CAP stands for Consistency, Availability and Partition tolerance.


1 Answers

Argument for Consistency

The document very clearly says: "The consistency model of a Hadoop FileSystem is one-copy-update-semantics; that of a traditional local POSIX filesystem."

(One-copy update semantics means the file contents seen by all of the processes accessing or updating a given file would see as if only a single copy of the file existed.)

Moving forward, the document says:

  • "Create. Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the file and its data."
  • "Update. Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the new data.
  • "Delete. once a delete() operation on a path other than “/” has completed successfully, it MUST NOT be visible or accessible. Specifically, listStatus(), open() ,rename() and append() operations MUST fail."

The above mentioned characteristics point towards the presence of "Consistency" in the HDFS.

Source: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html

Argument for Partition Tolerance

HDFS provides High Availability for both Name Nodes and Data Nodes.

Source: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html

Argument for Lack of Availability

It is very clearly mentioned in the documentation(under the section "Operations and failures"):

"The time to complete an operation is undefined and may depend on the implementation and on the state of the system."

This indicates that the "Availability" in the context of CAP is missing in HDFS.

Source: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html

Given the above mentioned arguments, I believe HDFS supports "Consistency and Partition Tolerance" and not "Availability" in the context of CAP theorem.

like image 76
Shariq Ehsan Avatar answered Oct 02 '22 15:10

Shariq Ehsan