Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive performance

Tags:

hadoop

hive

I work on hive and i am new to it. I am facing some issues regarding the performance in hive query.

  1. Number of mappers allocated to my job is very low even though there are hundreds of mappers available. I have tried setting mapred.map.tasks=200. But it takes only 20 to 30 mappers. I understand, number of mappers depend upon the inputsplit. Is there any other option to increase the mappers? if no then why is the parameter(mapred.map.tasks) introduced ?

  2. Is there any resource where i can understand to correlate hive queries to map-reduce jobs, i.e where the different part of the query is executed?

like image 883
bcarthic Avatar asked Dec 11 '12 18:12

bcarthic


People also ask

Why is Hive so slow?

Without partitioning, Hive reads all the data in the directory and applies the query filters to it. This is slow and expensive since all data has to be read.


2 Answers

For more information about setting map tasks, check this link: http://wiki.apache.org/hadoop/HowManyMapsAndReduces. Basically, mapred.map.tasks is just a hint; it doesn't really control anything usually.

To see how Hive queries are executed, simply preface your query with explain. For example: explain select foo from bar;. If you need even more information, there's also explain extended.

like image 132
Joe K Avatar answered Nov 02 '22 15:11

Joe K


I see this question has been asked long time ago, I'll try to answer it even though some of the suggestions here would not be available at the time when question has been asked.

To optimize Hive performance:

  • Tuning the number of mappers and reducers used by your Hive request; this could be done by tuning the input size for each mapper mapreduce.input.fileinputformat.split.maxsize, and the input size for each reducer: hive.exec.reducers.bytes.per.reducer

bare in mind that "the more the better" is not always true. So you need to tune those numbers to your needs.

  • Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)... (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)

  • Partition your table on columns that are often used in conditions (WHERE).
    For example if you are requesting frequently
    SELECT * from myTable WHERE someColumn = 'someValue'
    it is recommended to partition your table on the column 'someColumn'
    This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.

  • Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: hive.intermediate.compression.codec

  • Choosing the right compression codec, for example using Snappy (as in here):

    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
    SET mapred.output.compression.type=BLOCK;
    

Not been available at the time of question:

  • Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )

  • Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example:

    <property>
        <name>hive.execution.engine</name>
        <value>tez</value>
    </property>
    

For further optimization you could refer here

like image 23
user1314742 Avatar answered Nov 02 '22 16:11

user1314742