Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sum values of duplicate rows with awk?

Tags:

awk

I have a CSV file with a heading row and multiple data rows each with 11 data columns like this:

Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205583995140400,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205486016644400,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205588935534900,,2,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,[email protected],205588935534900,,1,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address

I want to remove the duplicates in that file and sum the values in the Quantity data column. I want the result to be like this:

Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205583995140400,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Syahrul Address
05 Jun 2018,[email protected],205486016644400,,4,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205588935534900,,3,Gold,05 Jun 2018 – 10:01,In Process,Rp3.000.000,Done,Rutwan Address

I want to sum only the values in the fifth data column Quantity while leaving the rest as it is. I have tried the solution in Sum duplicate row values with awk, but the answer there works only if the file has only two data columns. My CSV file has 11 data columns and so it doesn't work.

How to do it with awk?

like image 845
Joe Avatar asked Dec 24 '22 06:12

Joe


2 Answers

awk to the rescue!

$ awk 'BEGIN{FS=OFS=","} 
       NR==1{print; next} 
            {q=$5; $5="~"; a[$0]+=q} 
       END  {for(k in a) {sub("~",a[k],k); print k}}' file

Order Date,Username,Order Number,No Resi,Quantity,Title,Update Date,Status,Price Per Item,Status Tracking,Alamat
05 Jun 2018,[email protected],205588935534900,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Rutwan Address
05 Jun 2018,[email protected],205486016644400,,4,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Faishal  Address
05 Jun 2018,[email protected],205583995140400,,3,Gold,05 Jun 2018 - 10:01,In Process,Rp3.000.000,Done,Syahrul Address

note that the order of records are not guaranteed, but also doesn't require them to be sorted initially. To preserve the order there are multiple solutions...

Also, I use ~ as a placeholder. If your data includes this char you can replace with an unused one.

UPDATE

To preserve the order (based on first appearance of a row)

$ awk 'BEGIN{FS=OFS=","} 
       NR==1{print; next} 
            {q=$5;$5="~"; if(!($0 in a)) b[++c]=$0; a[$0]+=q} 
       END  {for(k=1;k<=c;k++) {sub("~",a[b[k]],b[k]); print b[k]}}' file

keep a separate structure to mark the order of the rows and iterate over that data structure...

like image 58
karakfa Avatar answered Jan 25 '23 13:01

karakfa


Taking direct adaption from Karafka's solution and adding some code in it a bit to get the lines in proper order(in which they are present in Input_file) as per OP's request.

awk -F, '
FNR==1{
  print;
  next}
{
  val=$5;
  $5="~";
  a[$0]+=val
}
!b[$0]++{
  c[++count]=$0}
END{
  for(i=1;i<=count;i++){
     sub("~",a[c[i]],c[i]);
     print c[i]}
}' OFS=,   Input_file

Explanation: Adding explanation to above code too now.

awk -F, '                         ##Setting field separator as comma here.
FNR==1{                           ##Checking condition if line number is 1 then do following.
  print;                          ##Print the current line.
  next}                           ##next will skip all further statements from here.
{
  val=$5;                         ##Creating a variable named val whose value is 5th field of current line.
  $5="~";                         ##Setting value of 5th field as ~ here to keep all lines same(to create index for array a).
  a[$0]+=val                      ##Creating an array named a whose index is current line and its value is variable val value.
}
!b[$0]++{                         ##Checking if array b whose index is current line its value is NULL then do following.
  c[++count]=$0}                  ##Creating an array named c whose index is variable count increasing value with 1 and value is current line.
END{                              ##Starting END block of awk code here.
  for(i=1;i<=count;i++){          ##Starting a for loop whose value starts from 1 to till value of count variable.
     sub("~",a[c[i]],c[i]);       ##Substituting ~ in value of array c(which is actually lines value) with value of SUMMED $5.
     print c[i]}                  ##Printing newly value of array c where $5 is now replaced with its actual value.
}' OFS=, Input_file               ##Setting OFS as comma here and mentioning Input_file name here too.
like image 29
RavinderSingh13 Avatar answered Jan 25 '23 13:01

RavinderSingh13