Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using awk to sort fields and arrange

Tags:

awk

Im trying to learn awk at the moment and I want to do a specific task. My question is similar in scope to one previously posted(Using awk to transpose column to row), but wouldn't quite work for my data. I have been trying to work out why and im sure its quite simple.

I have large data in a tab delimited table with only two fields (example below):

1101\t7778
1101\t7755
1101\t8889
1101\t6789
2300\t1220
4000\t2333
4000\t7555
4000\t9000
4000\t1111

and I want to end up appending the second field onto a row when the field matches. The desired output would be:

1101\t7778\t7755\t8889\t6789
2300\t1220
4000\t2333\t7555\t9000\t1111

If possible, Id like to get an explaination of all the parts within the command so I can understand it in the future. Thanks in advance.

like image 742
gammyknee Avatar asked Dec 15 '22 10:12

gammyknee


2 Answers

awk '    { list[$1] = list[$1] "\t" $2 }
     END { for (i in list) printf "%s%s\n", i, list[i] }' data

The first line adds a tab and the second field to the list element indexed by $1. The second line prints out the key and the accumulated list of values.

Sample output:

1101    7778    7755    8889    6789
4000    2333    7555    9000    1111
2300    1220

If you want the first column sorted, you can pipe the output through sort -n. If you have GNU awk, you can investigate the built-in sort function too:

/usr/gnu/bin/awk '    { list[$1] = list[$1] "\t" $2 }
                  END { n = asorti(list, indexes);
                        for (i = 1; i <= n; i++)
                            printf "%s%s\n", indexes[i], list[indexes[i]]
                      }' data

Sorted output:

1101    7778    7755    8889    6789
2300    1220
4000    2333    7555    9000    1111
like image 109
Jonathan Leffler Avatar answered Jan 22 '23 11:01

Jonathan Leffler


For abasu's request a pure bash version:

#!/bin/bash

declare -A hash
while read x y; do
  hash[$x]=${hash[$x]}"\t"$y
done <<XXX
1101    7778
1101    7755
1101    8889
1101    6789
2300    1220
4000    2333
4000    7555
4000    9000
4000    1111
XXX

for i in ${!hash[*]}; { echo -e $i${hash[$i]};}

Output:

2300    1220
1101    7778    7755    8889    6789
4000    2333    7555    9000    1111

In the here-is-the-document there is a tab character between the columns, as well as in between the output columns. If the -e is removed from the last line after echo the output is:

2300\t1220
1101\t7778\t7755\t8889\t6789
4000\t2333\t7555\t9000\t1111
like image 20
TrueY Avatar answered Jan 22 '23 11:01

TrueY