Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep the string and return the line with the highest value if no ties

Tags:

bash

Given the strings I have that are based on ASVs (like keys) from a different file, I want to return the ASV matches that have the highest value. However, there are sometimes ties which can be a problem so instead I want to return a message.

I am able to return the line with the highest value after sorting, but I don't know how to account for ties properly.

Here is a snippet of my file called file.txt:

ASV,Kingdom,Phylum,Class,Order,Family,Genus,Species,Hits
29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,uncultured bacterium,5
29ec61e470705074f483368a70ad18a7,Bacteria,Chloroflexota,Anaerolineae,???,???,???,uncultured Anaerolineae,2
29ec61e470705074f483368a70ad18a7,Bacteria,Chloroflexota,???,???,???,???,uncultured Chloroflexota,1
29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,unidentified marine,1
29ec61e470705074f483368a70ad18a7,Bacteria,Chloroflexota,Chloroflexia,Chloroflexales,Chloroflexaceae,Chloroflexus,uncultured Chloroflexus,1
74627d6dc445e8b5f46a787cf81c4294,Bacteria,Pseudomonadota,Gammaproteobacteria,Legionellales,Legionellaceae,???,uncultured Legionellaceae,2
74627d6dc445e8b5f46a787cf81c4294,Bacteria,???,???,???,???,???,uncultured bacterium,5
74627d6dc445e8b5f46a787cf81c4294,Bacteria,Pseudomonadota,Gammaproteobacteria,Legionellales,Legionellaceae,Legionella,Legionella sp.,3
55b1bec5f8dbe1b58007aee7ede9bae3,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma utsteinense,2
55b1bec5f8dbe1b58007aee7ede9bae3,Bacteria,???,???,???,???,???,uncultured bacterium,2
55b1bec5f8dbe1b58007aee7ede9bae3,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma sp.,2
55b1bec5f8dbe1b58007aee7ede9bae3,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma rigui,6
8964b7d833654ceedbdb6f6f25fb7d6a,Bacteria,???,???,???,???,???,uncultured bacterium,8
8964b7d833654ceedbdb6f6f25fb7d6a,Bacteria,Bacillota,Tissierellia,Tissierellales,Peptoniphilaceae,Finegoldia,Finegoldia magna,1
8964b7d833654ceedbdb6f6f25fb7d6a,???,???,???,???,???,???,uncultured organism,1
9966f0e6e452c31de46d030bab01fdd9,Bacteria,Bacteroidota,Sphingobacteriia,Sphingobacteriales,???,???,uncultured Cytophagales,2
9966f0e6e452c31de46d030bab01fdd9,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma jeollabukense,2
9966f0e6e452c31de46d030bab01fdd9,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma migulaei,2
9966f0e6e452c31de46d030bab01fdd9,Bacteria,Bacteroidota,Cytophagia,Cytophagales,Cytophagaceae,Spirosoma,Spirosoma sp.,1

As an example, searching for the ASV 29ec61e470705074f483368a70ad18a7 and returning the match with the highest value (the very last column) is easy:

Code:

> grep 29ec61e470705074f483368a70ad18a7 file.txt | sort -t, -nr -k9 | head -n1

# Output
29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,uncultured bacterium,5

But if I am searching an ASV such as 9966f0e6e452c31de46d030bab01fdd9, I would need it to somehow return or know that three lines can be returned (3 of them have the value 2) and instead output a message:

Ideal output:

  > grep 9966f0e6e452c31de46d030bab01fdd9 file.txt | does something

  # Output
  CHECK: There are 3 lines tied for top.
like image 534
Katherine Chau Avatar asked Nov 07 '25 06:11

Katherine Chau


1 Answers

I would opt for an awk solution:

  • eliminates need for sorting
  • eliminates overhead of subshells
  • easy to add desired logic

One awk idea:

awk -F, -v asv="29ec61e470705074f483368a70ad18a7" '
$1 == asv { if ($NF >= max_val) {
               if ($NF > max_val) {
                  delete matches
                  cnt = 0
                  max_val = $NF
               }
               matches[++cnt] = $0
            }
          }
END       { if (cnt == 0)
               print "WARNING: No lines found."
            else 
            if (cnt==1)
               print matches[cnt]
            else
               printf "CHECK: There are %s lines tied for top [max value = %s].\n", cnt, max_val
          }
' file.txt

For asv="123" this generates:

WARNING: No lines found.

For asv="29ec61e470705074f483368a70ad18a7" this generates:

29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,uncultured bacterium,5

For asv="9966f0e6e452c31de46d030bab01fdd9" this generates:

CHECK: There are 3 lines tied for top [max value = 2].

Wrapping this in a bash function:

max_asv() {
awk -F, -v asv="$1" '
$1 == asv { if ($NF >= max_val) {
               if ($NF > max_val) {
                  delete matches
                  cnt = 0
                  max_val = $NF
               }
               matches[++cnt] = $0
            }
          }
END       { if (cnt == 0)
               print "WARNING: No lines found."
            else 
            if (cnt==1)
               print matches[cnt]
            else
               printf "CHECK: There are %s lines tied for top [max value = %s].\n", cnt, max_val
          }
' "$2"
}

Taking the function for a test drive:

$ max_asv 123 file.txt
WARNING: No lines found.

$ max_asv 29ec61e470705074f483368a70ad18a7 file.txt
29ec61e470705074f483368a70ad18a7,Bacteria,???,???,???,???,???,uncultured bacterium,5

$ max_asv 9966f0e6e452c31de46d030bab01fdd9 file.txt
CHECK: There are 3 lines tied for top [max value = 2].
like image 187
markp-fuso Avatar answered Nov 09 '25 00:11

markp-fuso



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!