Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Download large data for Hadoop [closed]

I need a large data (more than 10GB) to run Hadoop demo. Anybody known where I can download it. Please let me know.

like image 279
Nevis Avatar asked Jun 01 '12 03:06

Nevis


People also ask

How does Hadoop store big data?

1. Capacity: Hadoop stores large volumes of data. By using a distributed file system called an HDFS (Hadoop Distributed File System), the data is split into chunks and saved across clusters of commodity servers.

How much data can Hadoop handle?

HDFS can easily store terrabytes of data using any number of inexpensive commodity servers. It does so by breaking each large file into blocks (the default block size is 64MB; however the most commonly used block size today is 128MB).

Is Hadoop still relevant?

Or, is it dead altogether? In reality, Apache Hadoop is not dead, and many organizations are still using it as a robust data analytics solution. One key indicator is that all major cloud providers are actively supporting Apache Hadoop clusters in their respective platforms.

Is Spark better than Hadoop?

Like Hadoop, Spark splits up large tasks across different nodes. However, it tends to perform faster than Hadoop and it uses random access memory (RAM) to cache and process data instead of a file system. This enables Spark to handle use cases that Hadoop cannot.


2 Answers

I would suggest you downloading million songs Dataset from the following website:

http://labrosa.ee.columbia.edu/millionsong/

The best thing with Millions Songs Dataset is that you can download 1GB (about 10000 songs), 10GB, 50GB or about 300GB dataset to your Hadoop cluster and do whatever test you would want. I love using it and learn a lot using this data set.

To start with you can download dataset start with any one letter from A-Z, which will be range from 1GB to 20GB.. you can also use Infochimp site:

http://www.infochimps.com/collections/million-songs

In one of my following blog I showed how to download 1GB dataset and run Pig scripts:

http://blogs.msdn.com/b/avkashchauhan/archive/2012/04/12/processing-million-songs-dataset-with-pig-scripts-on-apache-hadoop-on-windows-azure.aspx

like image 140
AvkashChauhan Avatar answered Sep 30 '22 01:09

AvkashChauhan


Tom White mentioned about a sample weather data set in his Book(Hadoop: the definitive guide).

http://hadoopbook.com/code.html

Data is available for more than 100 years.

I used wget in linux to pull the data. For the year 2007 itself the data size is 27 GB.

It is hosted as an FTP link. So, you can download with any FTP utility.

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

For complete details please check my blog:

http://myjourneythroughhadoop.blogspot.in/2013/07/how-to-download-weather-data-for-your.html

like image 37
Jagadish Talluri Avatar answered Sep 30 '22 01:09

Jagadish Talluri