I have a huge text file of form data is saved in directory data/data1.txt, data2.txt and so on <pre class="prettyprint"><code>merchant_id, user_id, amount 1234, 9123, 299.2 1233, 9199, 203.2 1234, 0124, 230 and so on.. </code></pre> What I want to do is for each merchant, find the average amount.. so basically in the end i want to save the output in file. something like <pre class="prettyprint"><code> merchant_id, average_amount 1234, avg_amt_1234 a and so on. </code></pre> How do I calculate the standard deviation as well? Sorry for asking such a basic question. :( Any help would be appreciated. :)

Apache PIG is well adapted for such tasks. See example: <pre class="prettyprint"><code>inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray); grp = group inpt by id; mean = foreach grp { sum = SUM(inpt.amnt); count = COUNT(inpt); generate group as id, sum/count as mean, sum as sum, count as count; }; </code></pre> Pay special attention to the data type of the amnt column as it will influence which implementation of the SUM function PIG is going to invoke. PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation. <pre class="prettyprint"><code> mean = foreach grp { sum = SUM(inpt.amnt); count = COUNT(inpt); generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count; }; </code></pre> FLATTEN(inpt) does the trick, now you have access to the original amount that had contributed to the groups average, sum and count. UPDATE 1: Calculating variance and standard deviation: <pre class="prettyprint"><code>inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray); grp = group inpt by id; mean = foreach grp { sum = SUM(inpt.amnt); count = COUNT(inpt); generate flatten(inpt), sum/count as avg, count as count; }; tmp = foreach mean { dif = (amnt - avg) * (amnt - avg) ; generate *, dif as dif; }; grp = group tmp by id; standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard; </code></pre> It will use 2 jobs. I have not figured out how to do it in one, hmm, need to spend more time on it.

finding mean using pig or hadoop

Tags:

hadoop

apache-pig

I have a huge text file of form

data is saved in directory data/data1.txt, data2.txt and so on

merchant_id, user_id, amount
1234, 9123, 299.2
1233, 9199, 203.2
 1234, 0124, 230
 and so on..

What I want to do is for each merchant, find the average amount..

so basically in the end i want to save the output in file. something like

 merchant_id, average_amount
  1234, avg_amt_1234 a
  and so on.

How do I calculate the standard deviation as well?

Sorry for asking such a basic question. :( Any help would be appreciated. :)

798

asked Sep 26 '12 01:09

frazman

1 Answers

Apache PIG is well adapted for such tasks. See example:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate group as id, sum/count as mean, sum as sum, count as count;
};

Pay special attention to the data type of the amnt column as it will influence which implementation of the SUM function PIG is going to invoke.

PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation.

 mean = foreach grp {
    sum = SUM(inpt.amnt);
    count = COUNT(inpt);
    generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};

FLATTEN(inpt) does the trick, now you have access to the original amount that had contributed to the groups average, sum and count.

UPDATE 1:

Calculating variance and standard deviation:

inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
        sum = SUM(inpt.amnt);
        count = COUNT(inpt);
        generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
    dif = (amnt - avg) * (amnt - avg) ;
     generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum; 
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;

It will use 2 jobs. I have not figured out how to do it in one, hmm, need to spend more time on it.

answered Sep 22 '22 01:09

alexeipab

Related questions
                            
                                start-dfs.sh: command not found
                            
                                Reading an ORC file in Java
                            
                                Hadoop namenode can't get out of safemode
                            
                                What is the difference between TRUNC and TO_DATE in Hive
                            
                                Specifying compression codec for a INSERT OVERWRITE SELECT in Hive
                            
                                How can I be sure that data is distributed evenly across the hadoop nodes?
                            
                                Log Structured Merge Tree in Hbase
                            
                                POC for Hadoop in real time scenario
                            
                                Specifying multiple filter criteria through Oozie command line
                            
                                Free Hadoop Cluster for Experiments [closed]
                            
                                Why we need to move external table to managed hive table?
                            
                                Differences between existing MapReduce and YARN (MRv2)
                            
                                spark on yarn; how to send metrics to graphite sink?
                            
                                Hadoop 2.x -- how to configure secondary namenode?
                            
                                query hive partitioned table over date/time range
                            
                                Kafka Memory requirement
                            
                                How to know the exact block size of a file on a Hadoop node?
                            
                                Hadoop HDFS - Difference between Missing replica and Under replicated blocks
                            
                                hdfs copy multiple files to same target directory
                            
                                Hadoop streaming job failure: Task process exit with nonzero status of 137

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With