While a Hadoop Job is running or in progress if I write something to HDFS or Hbase then will that data be visible to all nodes in the cluster
1.)immediately?
2.)If not immediately then after how much time?
3.)Or the time really cannot be determined?
HDFS is strongly consistent, so once a write has completed successfully, the new data should be visible across all nodes immediately. Clearly the actual writing takes some time - see replication pipelining for some details on this.
This is in contrast to eventually consistent systems, where it may take an indefinite time (though often only a few milliseconds) before all nodes see a consistent view of the data.
Systems such as Cassandra have tunable consistency - each read and write can be performed at a different level of consistency to suit the operation being performed.
In best of my understanding the data is visible immediately, after write operation is finished.
Lets see some aspects of the process:
When client writes to HDFS data is written in all replicas, and after the write operation finished it should be perfectly available
There is also only one place with metadata - NameNode which also do not have any notion of isolation which would enable hiding data till some larger peace of work is done.
HBase is a different case - since it will write only LOG to HDFS immediately and its HFiles will be updated with new data after compaction only. In the same time - after HBase itself write something into HDFS - data will be visible immediately.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With