Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding mean using pig or hadoop

I have a huge text file of form

data is saved in directory data/data1.txt, data2.txt and so on

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..

What I want to do is for each merchant, find the average amount..

so basically in the end i want to save the output in file. something like

 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.

How do I calculate the standard deviation as well?

Sorry for asking such a basic question. :( Any help would be appreciated. :)

like image 798
frazman Avatar asked Sep 26 '12 01:09

frazman


People also ask

What is Pig in Hadoop used for?

Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-level of abstraction for processing over the MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes.

How do you find the average of a Pig?

The Pig-Latin AVG() function is used to compute the average of the numerical values within a bag. While calculating the average value, the AVG() function ignores the NULL values. To get the global average value, we need to perform a Group All operation, and calculate the average value using the AVG() function.

What is an advantage of Pig over SQL?

Pig uses a language called Pig Latin, which is similar to SQL. This language does not require as much code in order to analyze data. Pig is a high-level scripting platform for creating codes that run on Hadoop. Pig makes it easier to analyze, process, and clean big data without writing vanilla MapReduce jobs in Hadoop.


1 Answers

Apache PIG is well adapted for such tasks. See example:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};

Pay special attention to the data type of the amnt column as it will influence which implementation of the SUM function PIG is going to invoke.

PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation.

 mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};

FLATTEN(inpt) does the trick, now you have access to the original amount that had contributed to the groups average, sum and count.

UPDATE 1:

Calculating variance and standard deviation:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;

It will use 2 jobs. I have not figured out how to do it in one, hmm, need to spend more time on it.

like image 67
alexeipab Avatar answered Sep 22 '22 01:09

alexeipab