Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Vs Data Lake

I heard a new term Data Lake. I googled and got that

A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs"

The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop's cluster nodes of commodity computers.

Same thing is done by Hadoop. We have HDFS for Storage and MapReduce for Computation. I am little bit confuse about Hadoop and Data lake. What is difference between both. If they are same that why this term arise. Or how to define a data lake.

like image 688
Kishore Avatar asked Mar 14 '16 12:03

Kishore


4 Answers

Data Lake is an abstract "idea". Hadoop is specific technology/software. You can implement a data lake using hadoop or using different tool.

like image 138
facha Avatar answered Oct 07 '22 00:10

facha


Data Lake is a methodology of storing data within a system that facilitates the collation of data in variant schemas and structural forms, usually object blobs or files.

The concept of a data lake is closely tied to Apache Hadoop and its ecosystem of open source projects. All discussions of the data lake quickly lead to a description of how to build a data lake using the power of the Apache Hadoop ecosystem. It’s become popular because it provides a cost-effective and technologically feasible way to meet big data challenges. Organizations are discovering the data lake as an evolution from their existing data architecture.

Following whitepaper will serve as an execellent example for building data lake with Hadoop.

like image 29
2 revs Avatar answered Oct 07 '22 01:10

2 revs


The easiest way to think of Data Lake is to think of this large container that has like a real lake with rivers coming into the river you never know where the rivers are coming from (or what "type" of river).

Data Lake is able to stored the mass different types of data (Structured data, unstructured data, log files, real-time, images, etc. ) and to blend that together, to correlate many different data types. The key thing here is as we are moving from traditional way to the modern tools (like Hadoop, Cassandra, NoSQL DB, etc).

There's a whole bunch of data being created that we might get some value out of if we could analyze it. We can use the the Cloud to take that data, get it together in a store, and analyze it. In Azure, we have the Azure Data Lake Store. And we can take all of that data, and we can go and store that in Azure Data Lake Store. Azure Data Lake Store is like a cloud-based file service or file system that is pretty much unlimited in size.

We can run services on top of the data that's in that store. So you could use Hadoop or Spark in an HDInsight cluster, or you could use the Azure Data Lake analytic service, which is a complement to the Azure Data Lake Store. And what that service will let you do is to run jobs that effectively query the data you have stored in the Azure Data Lake store and generate output results.

Azure Data Lake Store is something where we could store all the data that we wanna analyze. Azure Data Lake Analytics as a service where we can run jobs that query that data to generate some sort of output for analysis. Hadoop is specific technology/ (open source distributed data processing cluster technology). You can implement a data lake using hadoop or using different tool.

like image 2
Nedzad G Avatar answered Oct 06 '22 23:10

Nedzad G


You've confused the concept (data lake) with a framework that can be used to implement them (Hadoop), but it's understandable because these terms are so closely associated with one another.

Hadoop is often associated with data lakes because some of the first data lakes were built using on-premises Hadoop. However, a data lake is just an architectural design pattern - data lakes can be built outside of Hadoop using any kind of scalable object storage (like Azure Data Lake or AWS S3 for example).

This site does a pretty good job of giving an overview of data lakes, including a history of data lakes that discusses Hadoop alongside other implementations. Here's another article that addresses how these terms get tied up together as well.

like image 2
Crash Override Avatar answered Oct 07 '22 01:10

Crash Override