I have the following hive query:
select count(distinct id) as total from mytable;
which automatically spawns:
1408 Mappers
1 Reducer
I need to manually set the number of reducers and I have tried the following:
set mapred.reduce.tasks=50 set hive.exec.reducers.max=50
but none of these settings seem to be honored. The query takes forever to run. Is there a way to manually set the reducers or maybe rewrite the query so it can result in more reducers? Thanks!
The number of reducers can be set in two ways as below: Using the command line: While running the MapReduce job, we have an option to set the number of reducers which can be specified by the controller mapred. reduce. tasks.
Ways To Change Number Of ReducersUpdate the driver program and set the setNumReduceTasks to the desired value on the job object. job. setNumReduceTasks(5); There is also a better ways to change the number of reducers, which is by using the mapred.
Yes, we can set the Number of Reducer to zero. This means it is map only. The data is not sorted and directly stored in HDFS. If we want the output from mapper to be sorted ,we can use Identity reducer.
writing query in hive like this:
SELECT COUNT(DISTINCT id) ....
will always result in using only one reducer. You should:
use this command to set desired number of reducers:
set mapred.reduce.tasks=50
rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be substantial.
Number of reducers depends also on size of the input file
By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
<property> <name>hive.exec.reducers.bytes.per.reducer</name> <value>1000000</value> </property>
or using set
$ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With