Uniq in awk; removing duplicate values in a column using awk

Tags:

I have a large datafile in the following format below:

ENST00000371026 WDR78,WDR78,WDR78,  WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458,  atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:

ENST00000371026 WDR78   WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458   atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

I tried the following code below but it doesn't seem to remove the duplicate values.

awk ' 
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $3

}' knownGeneFromUCSC.txt

How can I remove the duplicates in column 2 correctly?

701

asked Jun 04 '10 23:06

D W

1 Answers

Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.

The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array^* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.

The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.

^{* In reality all arrays in AWK are associative.}

awk '
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray)
  { 
    if (!(valueArray[i] in duplicateArray))
    { 
      duplicateArray[valueArray[i]] = 1
    }
  };
  printf $1 "\t";
  for (j in duplicateArray)
  {
    if (j)    # prevents printing an extra comma
    {
      printf j ",";
    }
  }
  printf "\t";
  print $3
  delete duplicateArray    # for non-gawk, use split("", duplicateArray)
}'

119

answered Oct 14 '22 10:10

Dennis Williamson

Related questions
                            
                                Finding a uniq -c substitute for big files
                            
                                PS1 bash command substitution not working on windows 10
                            
                                How do I run two commands in a single line for loop in bash? [closed]
                            
                                Passing a variable to a remote host in a bash script with ssh and EOF [duplicate]
                            
                                what does [ -n "$VARIABLE" ] || exit 0 mean
                            
                                Is there way to check if shell script is executed with -x flag
                            
                                Modifying "... | tee -a out.txt" to stream output live, rather than on completion?
                            
                                Running script with relative paths from another directory
                            
                                echo to stdout and append to file
                            
                                Use output of one command as replacement text in sed
                            
                                How to run sub Shell script in Makefile?
                            
                                Replace hyphens with underscores in bash script [duplicate]
                            
                                print environment variables sorted by name including variables with newlines
                            
                                Difference between shebang flags vs. set builtin flags
                            
                                Reformatting text file using awk and cut as a one liner
                            
                                Run command when bash script is stopped
                            
                                Bash alias query
                            
                                Verify a file exists over ssh
                            
                                Fork two processes and kill the second when the first is done
                            
                                cd Terminal at a given directory after running a Python script?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Uniq in awk; removing duplicate values in a column using awk

Tags:

bash

unique

awk

D W

People also ask

1 Answers

Dennis Williamson

Recent Activity

Donate For Us