Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove duplicateds words for multiple strings in bash

Tags:

bash

sed

awk

I would like to know how can I remove duplicate words from every line in bash with sed, awk, etc...

I have this file with 2000 lines, and I would like to know how can I keep one unique word per line:

OG0000005 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373  K00373
OG0000006 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374  K00374
OG0000007 K03089  K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089
OG0000008 K15554  K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599
OG0000009 K15555  K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555
OG0000010 K00817  K09758 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817
OG0000011 K07267  K07267  K07267  K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267
OG0000012 K22397  K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714
OG0000013 K00370  K07812 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370

So, the output should be like:

OG0000005 K00373
OG0000006 K00374
OG0000007 K03089  
OG0000008 K15554  K15599 
OG0000009 K15555 
OG0000010 K00817  K09758

I tried with:

sort file | uniq

wile read line
do
sort && uniq
done < file
like image 605
Juliana B C Avatar asked Jan 25 '23 05:01

Juliana B C


2 Answers

You may use this awk solution:

awk '
{
   delete seen
   printf "%s", $1
   for (i=2; i<=NF; ++i)
      if (!seen[$i]++)
         printf "%s", OFS $i
   print ""
}' file

OG0000005 K00373
OG0000006 K00374
OG0000007 K03089
OG0000008 K15554 K15599
OG0000009 K15555
OG0000010 K00817 K09758
OG0000011 K07267
OG0000012 K22397 K01714
OG0000013 K00370 K07812
like image 96
anubhava Avatar answered Jan 26 '23 19:01

anubhava


This might work for you (GNU sed):

sed -E ':a;s/(( +\S+)\>.*)\2\>/\1/;ta' file

Replace a string which begins with a word which is later repeated by the orginal string less the repeated word.

Repeat until failure.

like image 25
potong Avatar answered Jan 26 '23 18:01

potong