Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing data in HBase vs Parquet files

Tags:

I am new to big data and am trying to understand the various ways of persisting and retrieving data. I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase. My questions are :

  1. What is the use case of using Parquet instead HBase
  2. Is there a use case where Parquet can be used together with HBase.
  3. In case of performing joins will Parquet be better performant than HBase (say, accessed through a SQL skin like Phoenix)?
like image 657
sovan Avatar asked Sep 09 '18 08:09

sovan


People also ask

Does HBase store data?

There are no data types in HBase; data is stored as byte arrays in the cells of HBase table. The content or the value in cell is versioned by the timestamp when the value is stored in the cell. So each cell of an HBase table may contain multiple versions of data.

What is the advantage of using Parquet file?

Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.

Is HBase good for big data?

Apache HBase is an open-source, NoSQL, distributed big data store. It enables random, strictly consistent, real-time access to petabytes of data. HBase is very effective for handling large, sparse datasets.

What are the advantages of storing data in a Parquet file over using a CSV file?

Parquet files take much less disk space than CSVs (column Size on Amazon S3) and are faster to scan (column Data Scanned). As a result, the identical dataset is 16 times cheaper to store in Parquet format!


1 Answers

As you have already said in question, parquet is a storage while HBase is storage(HDFS) + Query Engine(API/shell) So a valid comparison should be done between parquet+Impala/Hive/Spark and HBase. Below are the key differences -

1) Disk space - Parquet takes less disk space in comparison to HBase. Parquet encoding saves more space than block compression in HBase.

2) Data Ingestion - Data ingestion in parquet is more efficient than HBase. A simple reason could be point 1. As in case of parquet, less data needs to be written on disk.

3) Record lookup on key - HBase is faster as this is a key-value storage while parquet is not. Indexing in parquet will be supported in future release.

4) Filter and other Scan queries - Since parquet store more information about records stored in a row group, it can skip lot of records while scanning the data. This is the reason, it's faster than HBase.

5) Updating records - HBase provides record updates while this may be problematic in parquet as the parquet files needs to be re-written. A careful design of schema and partitioning may improve updates but it's not comparable with HBase.

By comparing the above features, HBase seems more suitable for situations where updates are required and queries involve mainly key-value lookup. Query involving key range scan will also have better performance in HBase.

Parquet is suitable for use cases where updates are very few and queries involves filters, joins and aggregations.

like image 127
Ajay Srivastava Avatar answered Sep 28 '22 09:09

Ajay Srivastava