In bash how to transform multimap to a map of

Question

I am processing output from a file in bash and need to group values by their keys.

For example, I have the

in a file and group all values from a particular key into a single line as in

13,47099,54024,1,39956,0
17,126223,52782,4,62617,0
23,1022724,79958,80590,230,1,118224,0,1049
42,72470,80185,2,89199,0
54,70344,72824,1,62969,1

There are about 10000 entries in my input file. How do I transform this data in shell ?

karakfa · Accepted Answer

awk to the rescue!

assuming keys are contiguous...

$ awk -F, 'p!=$1 {if(a) print a; a=p=$1} 
                 {a=a FS $2} 
           END   {print a}' file

13,47099,54024,1,39956,0                                                                                                                  
17,126223,52782,4,62617,0                                                                                                                 
23,1022724,79958,80590,230,1,118224,0,1049                                                                                                
42,72470,80185,2,89199,0                                                                                                                  
54,70344,72824,1,62969,1

Josh · Answer

Here is a breakdown of what @karakfa's code is doing, for us awk beginners. I've written this based on a toy dataset file:

1,X
1,Y
3,Z

p!=$1: check if the pattern p!=$1 is true
- checks if variable p is equal to the first field of the current (first) line of file (1 in this case)
- since p is undefined at this point it cannot be equal to 1, so p!=$1 is true and we continue with this line of code
if(a) print a: check if variable a exists and print a if it does exists
- since a is undefined at this point the print a command is not executed
a=p=$1: set variables a and p equal to the value of the first field of the current (first) line (1 in this case)
a=a FS $2: set variable a equal to a combined with the value of the second field of the current (first) line separated by the field separator (1,X in this case)
END: since we haven't reached the end of file yet, we skip the the rest of this line of code
move to the next (second) line of file and restart the awk code on that line
p!=$1: check if the pattern p!=$1 is true
- since p is 1 and the first field of the current (second) line is 1, p!=$1 is false and we skip the the rest of this line of code
a=a FS $2: set a equal to the value of a and the value of the second field of the current (second) line separated by the filed separator (1,X,Y in this case)
END: since we haven't reached the end of file yet, we skip the the rest of this line of code
move to the next (third) line of file and restart the awk code
p!=$1: check if the pattern p!=$1 is true
- since p is 1 and $1 of the third line is 3, p!=$1 is true and we continue with this line of code
if(a) print a: check if variable a exists and print a if it does exists
- since a is 1,X,Y at this point, 1,X,Y is printed to the output
a=p=$1: set variables a and p equal to the value of the first field of the current (third) line (3 in this case)
a=a FS $2: set variable a equal to a combined with the value of the second field of the current (third) line separated by the field separator (3,Z in this case)
END {print a}: since we have reached the end of file, execute this code
- print a: print the last group a (3,Z in this case)

The resulting output is

1,X,Y
3,Z

Please let me know if there are any errors in this description.

In bash how to transform multimap<K,V> to a map of <K, {V1,V2}>

Tags:

bash

mapreduce

Anoop

2 Answers

karakfa

Josh

Recent Activity

Donate For Us