I have a tab-delimited file with at least 16 (but might be more) columns, where the first column is an unique identifier; and >10,000 rows (only 6x6 shown in example), like this:
ID VAR1 VAR2 VAR3 VAR4 VAR5
1 1 1 1 1 1
2 -9 -9 -9 -9 -9
3 3 3 3 3 3
4 4 4 4 -9 4
5 5 5 5 5 5
6 6 -9 6 6 6
I need to change all values of VAR1-5 into "-9" if one of the values is already "-9"
So, the desired output would be:
ID VAR1 VAR2 VAR3 VAR4 VAR5
1 1 1 1 1 1
2 -9 -9 -9 -9 -9
3 3 3 3 3 3
4 -9 -9 -9 -9 -9
5 5 5 5 5 5
6 -9 -9 -9 -9 -9
So far, I've tried doing this in awk like this:
awk -F'\t' '
BEGIN{OFS="\t"}
{for(i=2;i<=NF;i++){if ($i=="-9"){for(j=2;j<=NF;j++){$j="-9"};continue}}};1
' < file1.tab
Which works, but is very slow when applied to the actual dataset. Is there a faster way to do this? Perhaps something with a combination of grep
and sed
?
Here's a variation which doesn't hard-code the number of columns.
awk -F '\t' '/(^|\t)-9(\t|$)/ {
printf $1; for(i=2; i<=NF; ++i) printf "\t-9"; printf "\n"
next }
1' file1 file2
The main optimization here is that Awk scans the entire line at once and triggers on the regex immediately, without needing to loop over all the fields unless it already knows that there is a match.
Because we know we will ditch all the fields except the first, there is no need to have Awk replace the fields so that it can then print them. Just generate the output we want to print and move on without touching Awk's internal representation of the line. This should buy a couple of cycles as well, though this is a very minor performance improvement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With