I work on hive and i am new to it. I am facing some issues regarding the performance in hive query.
Number of mappers allocated to my job is very low even though there
are hundreds of mappers available. I have tried setting
mapred.map.tasks=200
. But it takes only 20 to 30 mappers. I
understand, number of mappers depend upon the inputsplit. Is there
any other option to increase the mappers? if no then why is the
parameter(mapred.map.tasks
) introduced ?
Is there any resource where i can understand to correlate hive queries to map-reduce jobs, i.e where the different part of the query is executed?
Without partitioning, Hive reads all the data in the directory and applies the query filters to it. This is slow and expensive since all data has to be read.
For more information about setting map tasks, check this link: http://wiki.apache.org/hadoop/HowManyMapsAndReduces. Basically, mapred.map.tasks is just a hint; it doesn't really control anything usually.
To see how Hive queries are executed, simply preface your query with explain
. For example: explain select foo from bar;
. If you need even more information, there's also explain extended
.
I see this question has been asked long time ago, I'll try to answer it even though some of the suggestions here would not be available at the time when question has been asked.
To optimize Hive performance:
mapreduce.input.fileinputformat.split.maxsize
, and the input size for each reducer: hive.exec.reducers.bytes.per.reducer
bare in mind that "the more the better" is not always true. So you need to tune those numbers to your needs.
Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)... (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)
Partition your table on columns that are often used in conditions (WHERE).
For example if you are requesting frequentlySELECT * from myTable WHERE someColumn = 'someValue'
it is recommended to partition your table on the column 'someColumn'
This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.
Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: hive.intermediate.compression.codec
Choosing the right compression codec, for example using Snappy (as in here):
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
Not been available at the time of question:
Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )
Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
For further optimization you could refer here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With