BASH

Question

I have the following tab-separated file:

A1      A1      0       0       2       1       1 1     1 1     1 1     2 1     1 1
A2      A2      0       0       2       1       1 1     1 1     1 1     1 1     1 1
A3      A3      0       0       2       2       1 1     2 2     1 1     1 1     1 1
A5      A5      0       0       2       2       1 1     1 1     1 1     1 2     1 1

The idea is to summarise the information between column 7 (included) and the end in a new column that is added at the end of the file.

To do so, these are the rules:

If the total number of “2”s in the row (between column 7 and the end) is 0: add “1 1” to the new last column
If the total number of “2”s in the row (between column 7 and the end) is 1: add “1 2” to the new last column
If the total number of “2”s in the row (between column 7 and the end) is 2 or more: add “2 2” to the new last column

I started to extract the columns I want to work on using the command:

awk '{for (i = 7; i <= NF; i++) printf $i " "; print ""}' myfile.ped > tmp_myfile.txt

Then I count the number of occurrence in each row using:

sed 's/[^2]//g' tmp_myfile.txtt | awk '{print NR, length }' > tmp_occurences.txt

Which outputs:

Then my idea was to write a for loop that loops through the lines to add the new summary column. I was thinking in this kind of structure, based on what I found here: http://www.thegeekstuff.com/2010/06/bash-if-statement-examples:

while read line ;
do
    set $line

    If ["$2"==0]
    then
        $3=="1 1"

    elif ["$2"==1 ]
    then
        $3=="1 2”

    elif ["$2">=2 ]
    then 
        $3==“2 2”

    else
        print ["error"]

    fi
done < tmp_occurences.txt

But I am stuck here. Do I have to create the new column before starting the loop? Am I going in the right direction?

Ideally, the final output (after merging the first 6 columns from the initial file and the summary column) would be:

A1      A1      0       0       2       1       1 2
A2      A2      0       0       2       1       1 1
A3      A3      0       0       2       2       2 2
A5      A5      0       0       2       2       1 2

Thank you for your help!

anubhava · Accepted Answer

Using gnu-awk you can do:

awk -v OFS='	' '{
   c=0;
   for (i=7; i<=NF; i++)
      if ($i==2)
         c++
   if (c==0)
      s="1 1"
   else if (c==1)
      s="1 2"
   else
      s="2 2"
   NF=6
   print $0, s
}' file

A1  A1  0   0   2   1   1 2
A2  A2  0   0   2   1   1 1
A3  A3  0   0   2   2   2 2
A5  A5  0   0   2   2   1 2

PS: If not using gnu-awk you can use:

awk -v OFS='	' '{c=0; for (i=7; i<=NF; i++) {if ($i==2) c++; $i=""} if (c==0) s="1 1"; else if (c==1) s="1 2"; else s="2 2"; NF=6; print $0, s}' file

Ed Morton · Answer

With GNU awk for the 3rd arg to match():

$ awk '{match($0,/((\S+\s+){6})(.*)/,a); c=gsub(2,2,a[3]); print a[1] (c>1?2:1), (c>0?2:1)}' file
A1      A1      0       0       2       1       1 2
A2      A2      0       0       2       1       1 1
A3      A3      0       0       2       2       2 2
A5      A5      0       0       2       2       1 2

With other awks you'd replace \S/\s with [^[:space:]]/[[:space:]] and use substr() instead of a[].

BASH - Summarising information from several fields in unique field using Loop and If statements

Tags:

loops

if-statement

multiple-columns

awk

Svalf

2 Answers

anubhava

Ed Morton

Recent Activity

Donate For Us