gsub many columns simultaneously based on different gsub conditions?

Question

I have a file with the following data-

Input-

A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B

If any of the other rows starting from row 2 have the same letter as row 1, they should be changed to 1. Basically, I'm trying to find out how similar any of the rows are to the first row.

Desired Output-

1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B

The first row has become all 1 since it is identical to itself (obviously). In the second row, the first and second columns are identical to the first row (A B) and hence they become 1 1. And so on for the other rows.

I have written the following code which does this transformation-

for seq in {1..1} ; #Iterate over the rows (in this case just row 1)
do 
    for position in {1..6} ; #Iterate over the columns
    do 
        #Define the letter in the first row with which I'm comparing the rest of the rows
        aa=$(awk -v pos=$position -v line=$seq 'NR == line {print $pos}' f) 
        #If it matches, gsub it to 1 
        awk -v var=$aa -v pos=$position '{gsub (var, "1", $pos)} 1' f > temp
        #Save this intermediate file and now act on this
        mv temp f 
    done 
done

As you can imagine, this is really slow because that nested loop is expensive. My real data is a 60x10000 matrix and it takes about 2 hours for this program to run on that.

I was hoping you could help me get rid of the inner loop so that I can do all 6 gsubs in a single step. Maybe putting them in an array of their own? My awk skills aren't that great yet.

anubhava · Accepted Answer

You can use this simpler awk command to do the job which will be faster to complete as we are avoiding nested loops in shell and also invoking awk repeatedly in nested loop:

awk '{for (i=1; i<=NF; i++) {if (NR==1) a[i]=$i; if (a[i]==$i) $i=1} } 1' file

1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B

EDIT:

As per the comments below here is what you can do to get the sum of each column in each row:

awk '{sum=0; for (i=1; i<=NF; i++) { if (NR==1) a[i]=$i; if (a[i]==$i) $i=1; sum+=$i}
      print $0, sum}' file

1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3

Akshay Hegde · Answer

Input

$ cat f
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B

Desired o/p

$ awk 'FNR==1{split($0,a)}{for(i=1;i<=NF;i++)if (a[i]==$i) $i=1}1' f
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B

Explanation

FNR==1{ .. }

When awk reads first record of current file, do things inside braces

split(string, array [, fieldsep [, seps ] ])

Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array.

split($0,a)

split current record or row ($0) into pieces by fieldsep (defualt space, as we have not supplied 3rd argument) and store the pieces in array a So array a contains data from first row

       a[1] = A 
       a[2] = B
       a[3] = C 
       a[4] = D  
       a[5] = E  
       a[6] = F

for(i=1;i<=NF;i++)

Loop through all the fields of for each record of file till end of file.

if (a[i]==$i) $i=1

if first row's column value of current index (i) is equal to current column value of current row set current column value = 1 ( meaning modify current column value )

Now we modified column value next just print modified row

}1

1 always evaluates to true, it performs default operation {print $0}

For update request on comment

Same question here, I have a second part of the program that adds up the numbers in the rows. I.e. You would get 6, 2, 4, 2, 2, 3 for this output. Can your program be tweaked to get these values out at this step itself?

$ awk 'FNR==1{split($0,a)}{s=0;for(i=1;i<=NF;i++)if(a[i]==$i)s+=$i=1;print $0,s}' f
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3

gsub many columns simultaneously based on different gsub conditions?

Tags:

bash

loops

awk

gsub

Chem-man17

2 Answers

anubhava

Akshay Hegde

Recent Activity

Donate For Us

gsub many columns simultaneously based on different gsub conditions?

Tags:

bash

loops

awk

gsub

Chem-man17

2 Answers

anubhava

Akshay Hegde

Related questions

Recent Activity

Donate For Us