Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pig: Force one mapper per input line/row

I have a Pig Streaming job where the number of mappers should equal the number of rows/lines in the input file. I know that setting

set mapred.min.split.size 16 
set mapred.max.split.size 16
set pig.noSplitCombination true 

will ensure that each block is 16 bytes. But how do I ensure that each map job has exactly one line as input? The lines are variable length, so using a constant number for mapred.min.split.size and mapred.max.split.size is not the best solution.

Here is the code I intend to use:

input = load 'hdfs://cluster/tmp/input';
DEFINE CMD `/usr/bin/python script.py`;
OP = stream input through CMD;
dump OP;

SOLVED! Thanks to zsxwing

And, in case anyone else runs into this weird nonsense, know this:

To ensure that Pig creates one mapper for each input file you must set

set pig.splitCombination false

and not

set pig.noSplitCombination true

Why this is the case, I have no idea!

like image 619
sergeyf Avatar asked Jun 11 '13 22:06

sergeyf


1 Answers

Following your clue, I browsed the Pig source codes to find out the answer.

Set pig.noSplitCombination in the Pig script does't work. In the Pig script, you need to use pig.splitCombination. Then Pig will set the pig.noSplitCombination in JobConf according to the value of pig.splitCombination.

If you want to set pig.noSplitCombination directly, you need to use the command line. For example,

pig -Dpig.noSplitCombination=true -f foo.pig

The difference between these two ways is: if you use set instruction in the Pig script, it is stored in Pig properties. If you use -D, it is stored in Hadoop Configuration.

If you use set pig.noSplitCombination true, then (pig.noSplitCombination, true) is stored in Pig properties. But when Pig wants to init a JobConf, it fetches the value using pig.splitCombination from Pig properties. So your setting has not effect. Here is the source codes. The correct way is set pig.splitCombination false as you mentioned.

If you use -Dpig.noSplitCombination=true, (pig.noSplitCombination, true) is stored in Hadoop Configuration. Since JobConf is copied from Configuration, the value of -D is directly passed to JobConf.

At last, PigInputFormat reads pig.noSplitCombination from JobConf to decide if using the combination. Here is the source codes.

like image 123
zsxwing Avatar answered Nov 01 '22 13:11

zsxwing