Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pig: Control number of mappers

I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.

I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?

I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.

Can someone please help me understand how to control the number of maps and possibly share a working example?

like image 955
Gaurav Phapale Avatar asked Jun 16 '14 07:06

Gaurav Phapale


1 Answers

There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.

Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.

The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.

Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.

like image 90
alexeipab Avatar answered Nov 02 '22 23:11

alexeipab