Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add an integer to a difference calculation and print it to the end of a line?

Tags:

sed

awk

Goal: To print the difference between two fields separated by semicolons ($3 and $2) and add an integer (+1) to that calculated value at the end of each line beginning with ">".

Representative sample of my file:

>lcl|ORF1_      17609   17804   (+):21:131 unnamed protein product
MEKVKNKFDENDIKVPFVPSSLLFNNTGNLNTMDKR
>lcl|ORF2_      17609   17804   (+):70:111 unnamed protein product
MFLLHYYLIIQVI
>lcl|ORF3_      17609   17804   (+):112:147 unnamed protein product
MQWIKDKVLIK
>lcl|ORF4_      17609   17804   (+):129:91 unnamed protein product
MFYPLYLDYLYY
>lcl|ORF5_      17609   17804   (+):90:1 unnamed protein product, partial
MIMKKEQMELLYHSHQIYFLPFPLHQNIHP

Desired Output:

>lcl|ORF1_      17609   17804   (+):21:131 unnamed protein product:111
MEKVKNKFDENDIKVPFVPSSLLFNNTGNLNTMDKR
>lcl|ORF2_      17609   17804   (+):70:111 unnamed protein product:42
MFLLHYYLIIQVI
>lcl|ORF3_      17609   17804   (+):112:147 unnamed protein product:36
MQWIKDKVLIK
>lcl|ORF4_      17609   17804   (+):129:91 unnamed protein product:39
MFYPLYLDYLYY
>lcl|ORF5_      17609   17804   (+):90:1 unnamed protein product, partial:90
MIMKKEQMELLYHSHQIYFLPFPLHQNIHP

My current awk script gets me very close by printing the difference between $3 and $2 at the end of each line, but does not include the +1 addition step (required) and is not specific to lines beginning with ">", despite my attempt with /^ *>/ (not required, but nice):

$ awk -F":" 'BEGIN {OFS=FS} /^ *>/ {$4=$3-$2} $4<0 {$4=-$4} 1' file

>lcl|ORF1_      17609   17804   (+):21:131 unnamed protein product:110
MEKVKNKFDENDIKVPFVPSSLLFNNTGNLNTMDKR:::0
>lcl|ORF2_      17609   17804   (+):70:111 unnamed protein product:41
MFLLHYYLIIQVI:::0
>lcl|ORF3_      17609   17804   (+):112:147 unnamed protein product:35
MQWIKDKVLIK:::0
>lcl|ORF4_      17609   17804   (+):129:91 unnamed protein product:38
MFYPLYLDYLYY:::0
>lcl|ORF5_      17609   17804   (+):90:1 unnamed protein product, partial:89
MIMKKEQMELLYHSHQIYFLPFPLHQNIHP:::0

Attempts to add the integer (+1) to the difference calculation:

$ awk -F":" 'BEGIN {OFS=FS} /^ *>/ {$4+1=$3-$2} $4<0 {$4=-$4} 1' file
awk: line 1: syntax error at or near =

$ awk -F":" 'BEGIN {OFS=FS} /^ *>/ {$4+=1=$3-$2} $4<0 {$4=-$4} 1' file
awk: line 1: syntax error at or near =

$ awk -F":" -v n=1 'BEGIN {OFS=FS} /^ *>/ {$4+n=$3-$2} $4<0 {$4=-$4} 1' file
awk: line 1: syntax error at or near =

And although I'm not sure how to implement functions using awk, I think there could be some utility in using something similar to this:

$ function add_one (number) {
      return number + 1
  }
$ awk -F":" 'BEGIN {OFS=FS} /^ *>/ {add_one($4)=$3-$2} $4<0 {$4=-$4} 1' file

While I have been attempting to use awk to solve this problem, I am interested in any solution (e.g., since I am attempting to perform this calculation line-by-line, perhaps there is a more efficient solution with sed?).

like image 206
Gawain Avatar asked Apr 10 '21 04:04

Gawain


People also ask

What is it called when you add 1 2 3 4 5?

Partial sums Numbers of this form are called triangular numbers, because they can be arranged as an equilateral triangle. The infinite sequence of triangular numbers diverges to +∞, so by definition, the infinite series 1 + 2 + 3 + 4 + ⋯ also diverges to +∞.

How do you automatically add numbers in Excel?

Select a cell next to the numbers you want to sum, click AutoSum on the Home tab, press Enter, and you're done. When you click AutoSum, Excel automatically enters a formula (that uses the SUM function) to sum the numbers.

What is 1 100 added up?

Natural numbers are counting numbers only starting from 1. The sum of natural numbers 1 to 100 is 5050.


1 Answers

Here is an alternative awk solution that should work on all awk versions:

awk 'BEGIN {FS=OFS=":"} /^>/ {
   v3=$3+0
   diff = 1 + (v3 > $2 ? v3-$2 : $2-v3)
   $0 = $0 OFS diff
} 1' file

>lcl|ORF1_      17609   17804   (+):21:131 unnamed protein product:111
MEKVKNKFDENDIKVPFVPSSLLFNNTGNLNTMDKR
>lcl|ORF2_      17609   17804   (+):70:111 unnamed protein product:42
MFLLHYYLIIQVI
>lcl|ORF3_      17609   17804   (+):112:147 unnamed protein product:36
MQWIKDKVLIK
>lcl|ORF4_      17609   17804   (+):129:91 unnamed protein product:39
MFYPLYLDYLYY
>lcl|ORF5_      17609   17804   (+):90:1 unnamed protein product, partial:90
MIMKKEQMELLYHSHQIYFLPFPLHQNIHP

PS: Make sure to remove DOS line breaks from your input file before running this awk.

like image 160
anubhava Avatar answered Sep 23 '22 14:09

anubhava