Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is something written to HDFS or Hbase visible to all other nodes in Hadoop Cluster immediately?

While a Hadoop Job is running or in progress if I write something to HDFS or Hbase then will that data be visible to all nodes in the cluster

1.)immediately?

2.)If not immediately then after how much time?

3.)Or the time really cannot be determined?

like image 391
seahorse Avatar asked Feb 12 '12 12:02

seahorse


2 Answers

HDFS is strongly consistent, so once a write has completed successfully, the new data should be visible across all nodes immediately. Clearly the actual writing takes some time - see replication pipelining for some details on this.

This is in contrast to eventually consistent systems, where it may take an indefinite time (though often only a few milliseconds) before all nodes see a consistent view of the data.

Systems such as Cassandra have tunable consistency - each read and write can be performed at a different level of consistency to suit the operation being performed.

like image 117
DNA Avatar answered Sep 28 '22 16:09

DNA


In best of my understanding the data is visible immediately, after write operation is finished.
Lets see some aspects of the process:
When client writes to HDFS data is written in all replicas, and after the write operation finished it should be perfectly available
There is also only one place with metadata - NameNode which also do not have any notion of isolation which would enable hiding data till some larger peace of work is done.
HBase is a different case - since it will write only LOG to HDFS immediately and its HFiles will be updated with new data after compaction only. In the same time - after HBase itself write something into HDFS - data will be visible immediately.

like image 31
David Gruzman Avatar answered Sep 28 '22 15:09

David Gruzman