Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to store HBase data on AWS S3 for online application? How?

I am pretty new in AWS. I am planning to use HBase as a database for my system and going to install it on EC2 and its actual data files on S3 because of lower storage cost and having good integration with EMR. I don't want to use Amazon EMR for mere HBase which would available for 24/7 and avoid extra cost. But going to use it for some analytics later. Any idea how to configure HBase for such setup?

like image 953
NGR Avatar asked Dec 02 '22 14:12

NGR


2 Answers

No, you can't. It's not performance, it's how HBase implements atomic commits of updates: it relies on renames being O(1) atomic transactions, the same for create(path, overwrite=false). Renames as implemented by the Hadoop s3a client are slow and not a transaction: they are one by one copies of directory contents. As for create-no-overwrite, it's a check followed by a write; prone to race condition. Oh, and then there's eventual consistency, especially in listing.

Except in the special case that there is something alongside S3 itself providing the locking & leasing needed to manage these operations, you must not attempt to use S3 as a backing store for HBase. Azure has these features; EMR may, it's still a Work in Progress for Hadoop's S3A, and even there, the goal is not HBase atop S3, it's faster commit of Hive and spark work.

I write this as the person currently maintaining Hadoop's S3a client: I speak from knowledge of the codebase and what it takes for HBase to work.

Update: November 2018 Amazon EMR does support using S3 as a destination

like image 160
stevel Avatar answered Dec 25 '22 22:12

stevel


You have some information here:

It is now possible to use S3 as storage for HBase.

When you run HBase on Amazon EMR version 5.2.0 or later, you can enable Amazon S3 storage mode, which offers the following advantages:

The HBase root directory is stored in Amazon S3, including store files (HFiles) and table metadata. This data is persistent outside of the cluster, available across Amazon EC2 Availability Zones, and you don't need to recover using snapshots or other methods. With store files in Amazon S3, you can size your Amazon EMR cluster for your compute requirements instead of data requirements, with 3x replication in HDFS.

This has been also done by Finra, described here.

like image 39
Kobe-Wan Kenobi Avatar answered Dec 25 '22 22:12

Kobe-Wan Kenobi