Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do we need ZooKeeper in the Hadoop stack?

People also ask

What is the purpose of ZooKeeper in the Hadoop stack?

Zookeeper is a unit where the information regarding configuration, naming and group services are stored. It is a centralized unit and using these information. Zookeeper maintains Hadoop as a Single Unit and is responsible for synchronization of Hadoop tasks.

Why do we need ZooKeeper?

ZooKeeper is an open source Apache project that provides a centralized service for providing configuration information, naming, synchronization and group services over large clusters in distributed systems. The goal is to make these systems easier to manage with improved, more reliable propagation of changes.

When should I use ZooKeeper?

Apache ZooKeeper is used for maintaining centralized configuration information, naming, providing distributed synchronization, and providing group services in a simple interface so that we don't have to write it from scratch. Apache Kafka also uses ZooKeeper to manage configuration.

What is ZooKeeper used for stackoverflow?

Zookeeper is a centralized open-source server for maintaining and managing configuration information, naming conventions and synchronization for distributed cluster environment.


Hadoop 1.x does not use Zookeeper. HBase does use zookeeper even in Hadoop 1.x installations.

Hadoop adopted Zookeeper as well starting with version 2.0.

The purpose of Zookeeper is cluster management. This fits with the general philosophy of *nix of using smaller specialized components - so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.

Zookeeper is a distributed storage that provides the following guarantees (copied from Zookeeper overview page):

  • Sequential Consistency - Updates from a client will be applied in the order that they were sent.
  • Atomicity - Updates either succeed or fail. No partial results.
  • Single System Image - A client will see the same view of the service regardless of the server that it connects to.
  • Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
  • Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

You can use these to implement different "recipes" that are required for cluster management like locks, leader election etc.

If you're going to use ZooKeeper yourself, I recommend you take a look at Curator from Netflix which makes it easier to use (e.g. they implement a few recipes out of the box)


Zookeeper solves the problem of reliable distributed coordination, and hadoop is a distributed system, right?

There's an excellent paper Paxos Algorithm that you can read on this subject.


From zookeeper documentation page:

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

From hadoop documentation page:

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

Regarding your query:

Why do we need ZooKeeper in Hadoop Stack?

The binding factor is distributed processing and high availability.

e.g. Hadoop Namenode fail over process.

Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.

From Apache documentation link on HDFSHighAvailabilityWithQJM:

It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time.

During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.

Zookeeper has been used to avoid Split - brain scenario. You can find role of Zookeeper in below question:

How does Hadoop Namenode failover process works?