Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is meant by "HDFS lacks random read and write access"?

Tags:

hadoop

hbase

hdfs

Any file system should provide an API to access its files and directories, etc.

So, what is meant by "HDFS lacks random read and write access"?

So, we should use HBase.

like image 621
lovespring Avatar asked Jul 11 '14 21:07

lovespring


People also ask

What is write once read access in HDFS?

HDFS is built around the idea of “Write Once Read Many” that is it assumes a file once written in HDFS will not be modified, though it can be accessed 'n' number of times. HDFS is designed more for batch processing rather than interactive use by the users.

How file are read and write in HDFS?

HDFS follows Write Once Read Many models. So, we can't edit files that are already stored in HDFS, but we can include them by again reopening the file. This design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across all the data nodes in the cluster.

Is HDFS write once read many?

HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.

Does HBase provides random access to HDFS data?

HDFS is limited in that is does not support a random-access read of the data. Apache HBase (HBase, 2015) is an open-source distributed, versioned, NoSQL, or nonrelational, database that natively allows random access and indexing of data.


2 Answers

The default HDFS block size is 128 MB. So you cannot read one line here, one line there. You always read and write 128 MB blocks. This is fine when you want to process the whole file. But it makes HDFS unsuitable for some applications, like where you want to use an index to look up small records.

HBase on the other hand is great for this. If you want to read a small record, you will only read that small record.

HBase uses HDFS as its backing store. So how does it provide efficient record-based access?

HBase loads the tables from HDFS to memory or local disk, so most reads do not go to HDFS. Mutations are stored first in an append-only journal. When the journal gets large, it is built into an "addendum" table. When there are too many addendum tables, they all get compacted into a brand new primary table. For reads, the journal is consulted first, then the addendum tables, and at last the primary table. This system means that we only write a full HDFS block when we have a full HDFS block's worth of changes.

A more thorough description of this approach is in the Bigtable whitepaper.

like image 78
Daniel Darabos Avatar answered Oct 19 '22 05:10

Daniel Darabos


In a typical database where the data is stored in tables in RDBMS format you can read or write to any record from any table without having to know what is there in other records. This is called random writing/reading.

But in HDFS data is stored in the file format(generally) rather than table format. So if you are reading/writing its not as easy as is in RDBMS.

like image 42
tacticurv Avatar answered Oct 19 '22 03:10

tacticurv