Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDInsight: HBase or Azure Table Storage?

Currently my team is creating a solution that would use HDInsight. We will be getting 5TB of data daily and will need to do some map/reduce jobs on this data. Would there be any performance/cost difference if our data will be stored in Azure Table Storage instead of Azure HBase?

like image 244
Victor F Avatar asked Oct 28 '14 12:10

Victor F


People also ask

What is HBase on HDInsight?

HDInsight HBase is offered as a managed cluster that is integrated into the Azure environment.

What is HDInsight primarily used for?

HDInsight enables you to scale workloads up or down. You can reduce costs by creating clusters on demand and paying only for what you use. You can also build data pipelines to operationalize your jobs. Decoupled compute and storage provide better performance and flexibility.

What is the difference between Azure Data lake and BLOB storage?

Azure Blob Storage is a general purpose, scalable object store that is designed for a wide variety of storage scenarios. Azure Data Lake Storage Gen1 is a hyper-scale repository that is optimized for big data analytics workloads. Based on shared secrets - Account Access Keys and Shared Access Signature Keys.


2 Answers

The main differences will be in both functionality and cost.

Azure Table Storage doesn't have a map reduce engine attached to it in itself, though of course you could use the map reduce approach to write your own.

You can use Azure HDInsight to connect Map Reduce to table storage. There are a couple of connectors around, including one written by me which is hive focused and requires some configuration, and may not suit your partition scheme (http://www.simonellistonball.com/technology/hadoop-hive-inputformat-azure-tables/) and a less performance focused, but more complete version from someone at Microsoft (http://blogs.msdn.com/b/mostlytrue/archive/2014/04/04/analyzing-azure-table-storage-data-with-hdinsight.aspx).

The main advantage of Table Storage is that you aren't constantly taking processing cost.

If you use HBase, you will need to run a full cluster all the time, so there is a cost disadvantage, however, you will get some functionality and performance gains, plus you will have something a bit more portable, should you wish to use other hadoop platforms. You would also have access to a much greater range of analytic functionality with the HBase option.

like image 66
Simon Elliston Ball Avatar answered Oct 31 '22 04:10

Simon Elliston Ball


HDInsight (HBase/Hadoop) uses Azure Blob storage not ATS. For your data-storage you will charged only applicable blob storage cost, based on your subscription.

P.S. Don't forget to delete your cluster once job has completed, to avoid charges. Your data will persist in BLOB storage and can be used by next cluster you build.

like image 32
Abhijeet Avatar answered Oct 31 '22 02:10

Abhijeet