I have a Pig Streaming job where the number of mappers should equal the number of rows/lines in the input file. I know that setting
set mapred.min.split.size 16
set mapred.max.split.size 16
set pig.noSplitCombination true
will ensure that each block is 16 bytes. But how do I ensure that each map job has exactly one line as input? The lines are variable length, so using a constant number for mapred.min.split.size
and mapred.max.split.size
is not the best solution.
Here is the code I intend to use:
input = load 'hdfs://cluster/tmp/input';
DEFINE CMD `/usr/bin/python script.py`;
OP = stream input through CMD;
dump OP;
SOLVED! Thanks to zsxwing
And, in case anyone else runs into this weird nonsense, know this:
To ensure that Pig creates one mapper for each input file you must set
set pig.splitCombination false
and not
set pig.noSplitCombination true
Why this is the case, I have no idea!
Following your clue, I browsed the Pig source codes to find out the answer.
Set pig.noSplitCombination
in the Pig script does't work. In the Pig script, you need to use pig.splitCombination
. Then Pig will set the pig.noSplitCombination
in JobConf according to the value of pig.splitCombination
.
If you want to set pig.noSplitCombination
directly, you need to use the command line. For example,
pig -Dpig.noSplitCombination=true -f foo.pig
The difference between these two ways is: if you use set instruction in the Pig script, it is stored in Pig properties. If you use -D, it is stored in Hadoop Configuration.
If you use set pig.noSplitCombination true
, then (pig.noSplitCombination, true) is stored in Pig properties. But when Pig wants to init a JobConf, it fetches the value using pig.splitCombination
from Pig properties. So your setting has not effect. Here is the source codes. The correct way is set pig.splitCombination false
as you mentioned.
If you use -Dpig.noSplitCombination=true
, (pig.noSplitCombination, true) is stored in Hadoop Configuration. Since JobConf is copied from Configuration, the value of -D is directly passed to JobConf.
At last, PigInputFormat reads pig.noSplitCombination
from JobConf to decide if using the combination. Here is the source codes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With