I work on hive and i am new to it. I am facing some issues regarding the performance in hive query. <ol> <li>Number of mappers allocated to my job is very low even though there are hundreds of mappers available. I have tried setting <code>mapred.map.tasks=200</code>. But it takes only 20 to 30 mappers. I understand, number of mappers depend upon the inputsplit. Is there any other option to increase the mappers? if no then why is the parameter(<code>mapred.map.tasks</code>) introduced ?</li> <li>Is there any resource where i can understand to correlate hive queries to map-reduce jobs, i.e where the different part of the query is executed?</li> </ol>

For more information about setting map tasks, check this link: http://wiki.apache.org/hadoop/HowManyMapsAndReduces. Basically, mapred.map.tasks is just a hint; it doesn't really control anything usually. To see how Hive queries are executed, simply preface your query with <code>explain</code>. For example: <code>explain select foo from bar;</code>. If you need even more information, there's also <code>explain extended</code>.

I see this question has been asked long time ago, I'll try to answer it even though some of the suggestions here would not be available at the time when question has been asked. To optimize Hive performance: <ul> <li>Tuning the number of mappers and reducers used by your Hive request; this could be done by tuning the input size for each mapper <code>mapreduce.input.fileinputformat.split.maxsize</code>, and the input size for each reducer: <code>hive.exec.reducers.bytes.per.reducer</code> </li> </ul> bare in mind that "the more the better" is not always true. So you need to tune those numbers to your needs. <ul> <li>Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)... (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)</li> <li>Partition your table on columns that are often used in conditions (WHERE). For example if you are requesting frequently <code>SELECT * from myTable WHERE someColumn = 'someValue'</code> it is recommended to partition your table on the column 'someColumn' This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.</li> <li>Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: <code>hive.intermediate.compression.codec</code></li> <li> Choosing the right compression codec, for example using Snappy (as in here): <pre class="prettyprint"><code>SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK; </code></pre> </li> </ul> Not been available at the time of question: <ul> <li>Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )</li> <li> Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example: <pre class="prettyprint"><code><property> <name>hive.execution.engine</name> <value>tez</value> </property> </code></pre> </li> </ul> For further optimization you could refer here

Hive performance

Tags:

hadoop

hive

I work on hive and i am new to it. I am facing some issues regarding the performance in hive query.

Number of mappers allocated to my job is very low even though there are hundreds of mappers available. I have tried setting mapred.map.tasks=200. But it takes only 20 to 30 mappers. I understand, number of mappers depend upon the inputsplit. Is there any other option to increase the mappers? if no then why is the parameter(mapred.map.tasks) introduced ?
Is there any resource where i can understand to correlate hive queries to map-reduce jobs, i.e where the different part of the query is executed?

883

asked Dec 11 '12 18:12

bcarthic

2 Answers

For more information about setting map tasks, check this link: http://wiki.apache.org/hadoop/HowManyMapsAndReduces. Basically, mapred.map.tasks is just a hint; it doesn't really control anything usually.

To see how Hive queries are executed, simply preface your query with explain. For example: explain select foo from bar;. If you need even more information, there's also explain extended.

132

answered Nov 02 '22 15:11

Joe K

I see this question has been asked long time ago, I'll try to answer it even though some of the suggestions here would not be available at the time when question has been asked.

To optimize Hive performance:

Tuning the number of mappers and reducers used by your Hive request; this could be done by tuning the input size for each mapper mapreduce.input.fileinputformat.split.maxsize, and the input size for each reducer: hive.exec.reducers.bytes.per.reducer

bare in mind that "the more the better" is not always true. So you need to tune those numbers to your needs.

Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)... (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)
Partition your table on columns that are often used in conditions (WHERE).
For example if you are requesting frequently
SELECT * from myTable WHERE someColumn = 'someValue'
it is recommended to partition your table on the column 'someColumn'
This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.
Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: hive.intermediate.compression.codec

Choosing the right compression codec, for example using Snappy (as in here):

Click to copy

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Not been available at the time of question:

Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )
Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example:

Click to copy
```
<property>
 <name>hive.execution.engine</name>
 <value>tez</value>
</property>
```

For further optimization you could refer here

answered Nov 02 '22 16:11

user1314742

Related questions
                            
                                Pig 0.11.1 - Count groups in a time range
                            
                                InvalidProtocolBufferException when trying to write to HDFS
                            
                                Copy and extract files from s3 to HDFS
                            
                                How to read gz files in Spark using wholeTextFiles
                            
                                No space left on device exception, amazon EMR medium instances and S3
                            
                                Using Kafka to import data to Hadoop
                            
                                Does to_utc_timestamp take into account daylight saving?
                            
                                Hive View Partitions
                            
                                Compute Statistical mode in Hive
                            
                                Spark give Null pointer exception during InputSplit for Hbase
                            
                                how to pass variables in hive using hue
                            
                                Java or C++ API for Apache Drill
                            
                                Not able to fetch result from hive transaction enabled table through spark-sql
                            
                                Limit YARN containers programmatically
                            
                                How to make HDFS work in docker swarm
                            
                                Map Reduce Frameworks/Infrastructure
                            
                                0.20.2 API hadoop version with java 5
                            
                                Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB
                            
                                hadoop multiple already being created exception
                            
                                Using s3distcp with Amazon EMR to copy a single file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With