Filtering text file based on values in some lines using awk

Question

I am dealing with a text file and each record in the file is separated by blank line. I want to extract the records which meats certain criteria.

For example, my text file looks like this

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
11477   11043   single- 4   6   {SNAP_model.scaffold6_size143996-snap.2;SNAP

#EVM prediction: Mode:STANDARD S-ratio: 1.00 20968-21183 orient(+) score(432.00)
20968   21183   single+ 1   3   {GeneID_mRNA_scaffold6_size143996_6;GeneID}

#EVM prediction: Mode:STANDARD S-ratio: 1.00 21940-22362 orient(-) score(846.00)
22362   21940   single- 4   6   {GeneID_mRNA_scaffold6_size143996_7;GeneID}

#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
33363   33495   initial+    1   1   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496   33611   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612   33741   internal+   2   2   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742   33842   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843   34677   terminal+   3   3   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}

#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36) 
46879   46394   terminal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512   46880   INTRON          {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256   47513   internal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366   48257   INTRON          {Augustus_model.g41.t1;Augustus}
48429   48367   internal-   4   6   {Augustus_model.g41.t1;Augustus}
48510   48430   INTRON          {Augustus_model.g41.t1;Augustus}
48564   48511   initial-    4   6   {Augustus_model.g41.t1;Augustus}

Now, I want to extract the records with score greater 1000. I want to remove second and third record which has sccore-432 score(432.00)and score-846 score(846.00)

I have written awk code

awk -F '[()]' '{if ($4 > 1000) print $0}' input.out

but it is giving only first line as output. i.e

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36)

But I want to extract complete record corresponding to the score greater than 1000. Please help to extract complete record

anubhava · Accepted Answer

You may use this awk with an empty RS and match function:

awk -v RS= 'match($0, /score$[^)]+$/) && substr($0, RSTART+6, RLENGTH-7)+0 > 1000 {ORS = RT; print}' file

#EVM predictionEVM prediction: Mode:STANDARD S-ratio: 2.52 11043-11477 orient(-) score(1246.00)
11477   11043   single- 4   6   {SNAP_model.scaffold6_size143996-snap.2;SNAP

#EVM prediction: Mode:STANDARD S-ratio: 12.32 33363-34677 orient(+) score(21500.00)
33363   33495   initial+    1   1   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}
33496   33611   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33612   33741   internal+   2   2   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33742   33842   INTRON          {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus},{ev_type:GeMoMa/ID=model.scaffold6_size143996.rna-XM_007036272.2_R0;GeMoMa}
33843   34677   terminal+   3   3   {SNAP_model.scaffold6_size143996-snap.3;SNAP},{GeneID_mRNA_scaffold6_size143996_10;GeneID},{Augustus_model.g38.t1;Augustus}

#EVM prediction: Mode:STANDARD S-ratio: 2.41 46394-48564 orient(-) score(9677.00) noncoding_equivalent(4012.03) raw_noncoding(7194.39) offset(3182.36)
46879   46394   terminal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
47512   46880   INTRON          {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48256   47513   internal-   4   6   {GeneID_mRNA_scaffold6_size143996_13;GeneID}
48366   48257   INTRON          {Augustus_model.g41.t1;Augustus}
48429   48367   internal-   4   6   {Augustus_model.g41.t1;Augustus}
48510   48430   INTRON          {Augustus_model.g41.t1;Augustus}
48564   48511   initial-    4   6   {Augustus_model.g41.t1;Augustus}

A more readable version:

awk -v RS= '
match($0, /score$[^)]+$/) && substr($0, RSTART+6, RLENGTH-7)+0 > 1000 {
   ORS = RT
   print
}' file

Filtering text file based on values in some lines using awk

Tags:

python

shell

sed

awk

Santhosh Hegde

1 Answers

anubhava

Recent Activity

Donate For Us

Filtering text file based on values in some lines using awk

Tags:

python

shell

sed

awk

Santhosh Hegde

1 Answers

anubhava

Related questions

Recent Activity

Donate For Us