I installed hadoop 1.0.0 and tried out word counting example (single node cluster). It took 2m 48secs to complete. Then I tried standard linux word count program, which run in 10 milliseconds on the same set (180 kB data). Am I doing something wrong, or is Hadoop very very slow?
time hadoop jar /usr/share/hadoop/hadoop*examples*.jar wordcount someinput someoutput
12/01/29 23:04:41 INFO input.FileInputFormat: Total input paths to process : 30
12/01/29 23:04:41 INFO mapred.JobClient: Running job: job_201201292302_0001
12/01/29 23:04:42 INFO mapred.JobClient: map 0% reduce 0%
12/01/29 23:05:05 INFO mapred.JobClient: map 6% reduce 0%
12/01/29 23:05:15 INFO mapred.JobClient: map 13% reduce 0%
12/01/29 23:05:25 INFO mapred.JobClient: map 16% reduce 0%
12/01/29 23:05:27 INFO mapred.JobClient: map 20% reduce 0%
12/01/29 23:05:28 INFO mapred.JobClient: map 20% reduce 4%
12/01/29 23:05:34 INFO mapred.JobClient: map 20% reduce 5%
12/01/29 23:05:35 INFO mapred.JobClient: map 23% reduce 5%
12/01/29 23:05:36 INFO mapred.JobClient: map 26% reduce 5%
12/01/29 23:05:41 INFO mapred.JobClient: map 26% reduce 8%
12/01/29 23:05:44 INFO mapred.JobClient: map 33% reduce 8%
12/01/29 23:05:53 INFO mapred.JobClient: map 36% reduce 11%
12/01/29 23:05:54 INFO mapred.JobClient: map 40% reduce 11%
12/01/29 23:05:56 INFO mapred.JobClient: map 40% reduce 12%
12/01/29 23:06:01 INFO mapred.JobClient: map 43% reduce 12%
12/01/29 23:06:02 INFO mapred.JobClient: map 46% reduce 12%
12/01/29 23:06:06 INFO mapred.JobClient: map 46% reduce 14%
12/01/29 23:06:09 INFO mapred.JobClient: map 46% reduce 15%
12/01/29 23:06:11 INFO mapred.JobClient: map 50% reduce 15%
12/01/29 23:06:12 INFO mapred.JobClient: map 53% reduce 15%
12/01/29 23:06:20 INFO mapred.JobClient: map 56% reduce 15%
12/01/29 23:06:21 INFO mapred.JobClient: map 60% reduce 17%
12/01/29 23:06:28 INFO mapred.JobClient: map 63% reduce 17%
12/01/29 23:06:29 INFO mapred.JobClient: map 66% reduce 17%
12/01/29 23:06:30 INFO mapred.JobClient: map 66% reduce 20%
12/01/29 23:06:36 INFO mapred.JobClient: map 70% reduce 22%
12/01/29 23:06:37 INFO mapred.JobClient: map 73% reduce 22%
12/01/29 23:06:45 INFO mapred.JobClient: map 80% reduce 24%
12/01/29 23:06:51 INFO mapred.JobClient: map 80% reduce 25%
12/01/29 23:06:54 INFO mapred.JobClient: map 86% reduce 25%
12/01/29 23:06:55 INFO mapred.JobClient: map 86% reduce 26%
12/01/29 23:07:02 INFO mapred.JobClient: map 90% reduce 26%
12/01/29 23:07:03 INFO mapred.JobClient: map 93% reduce 26%
12/01/29 23:07:07 INFO mapred.JobClient: map 93% reduce 30%
12/01/29 23:07:09 INFO mapred.JobClient: map 96% reduce 30%
12/01/29 23:07:10 INFO mapred.JobClient: map 96% reduce 31%
12/01/29 23:07:12 INFO mapred.JobClient: map 100% reduce 31%
12/01/29 23:07:22 INFO mapred.JobClient: map 100% reduce 100%
12/01/29 23:07:28 INFO mapred.JobClient: Job complete: job_201201292302_0001
12/01/29 23:07:28 INFO mapred.JobClient: Counters: 29
12/01/29 23:07:28 INFO mapred.JobClient: Job Counters
12/01/29 23:07:28 INFO mapred.JobClient: Launched reduce tasks=1
12/01/29 23:07:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=275346
12/01/29 23:07:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/29 23:07:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/01/29 23:07:28 INFO mapred.JobClient: Launched map tasks=30
12/01/29 23:07:28 INFO mapred.JobClient: Data-local map tasks=30
12/01/29 23:07:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=137186
12/01/29 23:07:28 INFO mapred.JobClient: File Output Format Counters
12/01/29 23:07:28 INFO mapred.JobClient: Bytes Written=26287
12/01/29 23:07:28 INFO mapred.JobClient: FileSystemCounters
12/01/29 23:07:28 INFO mapred.JobClient: FILE_BYTES_READ=71510
12/01/29 23:07:28 INFO mapred.JobClient: HDFS_BYTES_READ=89916
12/01/29 23:07:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=956282
12/01/29 23:07:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=26287
12/01/29 23:07:28 INFO mapred.JobClient: File Input Format Counters
12/01/29 23:07:28 INFO mapred.JobClient: Bytes Read=85860
12/01/29 23:07:28 INFO mapred.JobClient: Map-Reduce Framework
12/01/29 23:07:28 INFO mapred.JobClient: Map output materialized bytes=71684
12/01/29 23:07:28 INFO mapred.JobClient: Map input records=2574
12/01/29 23:07:28 INFO mapred.JobClient: Reduce shuffle bytes=71684
12/01/29 23:07:28 INFO mapred.JobClient: Spilled Records=6696
12/01/29 23:07:28 INFO mapred.JobClient: Map output bytes=118288
12/01/29 23:07:28 INFO mapred.JobClient: CPU time spent (ms)=39330
12/01/29 23:07:28 INFO mapred.JobClient: Total committed heap usage (bytes)=5029167104
12/01/29 23:07:28 INFO mapred.JobClient: Combine input records=8233
12/01/29 23:07:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=4056
12/01/29 23:07:28 INFO mapred.JobClient: Reduce input records=3348
12/01/29 23:07:28 INFO mapred.JobClient: Reduce input groups=1265
12/01/29 23:07:28 INFO mapred.JobClient: Combine output records=3348
12/01/29 23:07:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=4936278016
12/01/29 23:07:28 INFO mapred.JobClient: Reduce output records=1265
12/01/29 23:07:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=26102546432
12/01/29 23:07:28 INFO mapred.JobClient: Map output records=8233
real 2m48.886s
user 0m3.300s
sys 0m0.304s
time wc someinput/*
178 1001 8674 someinput/capacity-scheduler.xml
178 1001 8674 someinput/capacity-scheduler.xml.bak
7 7 196 someinput/commons-logging.properties
7 7 196 someinput/commons-logging.properties.bak
24 35 535 someinput/configuration.xsl
80 122 1968 someinput/core-site.xml
80 122 1972 someinput/core-site.xml.bak
1 0 1 someinput/dfs.exclude
1 0 1 someinput/dfs.include
12 36 327 someinput/fair-scheduler.xml
45 192 2141 someinput/hadoop-env.sh
45 192 2139 someinput/hadoop-env.sh.bak
20 137 910 someinput/hadoop-metrics2.properties
20 137 910 someinput/hadoop-metrics2.properties.bak
118 582 4653 someinput/hadoop-policy.xml
118 582 4653 someinput/hadoop-policy.xml.bak
241 623 6616 someinput/hdfs-site.xml
241 623 6630 someinput/hdfs-site.xml.bak
171 417 6177 someinput/log4j.properties
171 417 6177 someinput/log4j.properties.bak
1 0 1 someinput/mapred.exclude
1 0 1 someinput/mapred.include
12 15 298 someinput/mapred-queue-acls.xml
12 15 298 someinput/mapred-queue-acls.xml.bak
338 897 9616 someinput/mapred-site.xml
338 897 9630 someinput/mapred-site.xml.bak
1 1 10 someinput/masters
1 1 18 someinput/slaves
57 89 1243 someinput/ssl-client.xml.example
55 85 1195 someinput/ssl-server.xml.example
2574 8233 85860 total
real 0m0.009s
user 0m0.004s
sys 0m0.000s
Hadoop performance tuning will help you in optimizing your Hadoop cluster performance and make it better to provide best results while doing Hadoop programming in Big Data companies. To perform the same, you need to repeat the process given below till desired output is achieved at optimal way. Run Job –> Identify Bottleneck –> Address Bottleneck.
There are many options provided by Hadoop on CPU, memory, disk, and network for performance tuning. Most Hadoop tasks are not CPU bounded, what is most considered is to optimize usage of memory and disk spills. Let us get into the details in this Hadoop performance tuning in Tuning Hadoop Run-time parameters.
In this post, we will provide a few MapReduce properties that can be used at various mapreduce phases to improve the performance tuning. There is no one-size-fits-all technique for tuning Hadoop jobs, because of the architecture of Hadoop, achieving balance among resources is often more effective than addressing a single problem.
Hadoop is a distributed file system that can store and process a massive amount of data clusters across computers. Hadoop from being open source is compatible with all the platforms since it is Java-based.
This depends on a large number of factors, including your configuration, your machine, memory config, JVM settings, etc. You also need to subtract JVM startup time.
It runs much more quickly for me. That said, of course it will be slower on small data sets than a dedicated C program--consider what it's doing "behind the scenes".
Try it on a terabyte of data spread across a few thousand files and see what happens.
As Dave said Hadoop is optimized to handle large amounts of data not toy examples There's a tax for "waking up the elephant" to get things going which is not needed when you work on smaller sets. You can take a look at "About the performance of Map Reduce Jobs" for some details on what's going on
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With