I need a large data (more than 10GB) to run Hadoop demo. Anybody known where I can download it. Please let me know.
1. Capacity: Hadoop stores large volumes of data. By using a distributed file system called an HDFS (Hadoop Distributed File System), the data is split into chunks and saved across clusters of commodity servers.
HDFS can easily store terrabytes of data using any number of inexpensive commodity servers. It does so by breaking each large file into blocks (the default block size is 64MB; however the most commonly used block size today is 128MB).
Or, is it dead altogether? In reality, Apache Hadoop is not dead, and many organizations are still using it as a robust data analytics solution. One key indicator is that all major cloud providers are actively supporting Apache Hadoop clusters in their respective platforms.
Like Hadoop, Spark splits up large tasks across different nodes. However, it tends to perform faster than Hadoop and it uses random access memory (RAM) to cache and process data instead of a file system. This enables Spark to handle use cases that Hadoop cannot.
I would suggest you downloading million songs Dataset from the following website:
http://labrosa.ee.columbia.edu/millionsong/
The best thing with Millions Songs Dataset is that you can download 1GB (about 10000 songs), 10GB, 50GB or about 300GB dataset to your Hadoop cluster and do whatever test you would want. I love using it and learn a lot using this data set.
To start with you can download dataset start with any one letter from A-Z, which will be range from 1GB to 20GB.. you can also use Infochimp site:
http://www.infochimps.com/collections/million-songs
In one of my following blog I showed how to download 1GB dataset and run Pig scripts:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/04/12/processing-million-songs-dataset-with-pig-scripts-on-apache-hadoop-on-windows-azure.aspx
Tom White mentioned about a sample weather data set in his Book(Hadoop: the definitive guide).
http://hadoopbook.com/code.html
Data is available for more than 100 years.
I used wget
in linux to pull the data. For the year 2007 itself the data size is 27 GB.
It is hosted as an FTP
link. So, you can download with any FTP utility.
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/
For complete details please check my blog:
http://myjourneythroughhadoop.blogspot.in/2013/07/how-to-download-weather-data-for-your.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With