I have a tab delimited file which looks like this
chr1 12226559 12227059 TNFRSF1B
chr1 17051560 17052060
chr1 17053279 17053779
chr1 17338423 17338923 ATP13A2
ATP13A2
ATP13A2
chr1 19577574 19578074 EMC1
MRTO4
chr1 19578046 19578546 EMC1
MRTO4
chr1 19638239 19638739 AKR7A2
PQLC2
PQLC2
PQLC2
AKR7A2
PQLC2
I want that the lines where value of column4 is repeated should be removed.
First three columns are co ordinates and in those co-ordinates whatever we find is listed (in col4), and for each co-ordinate I want to have only unique names and not the repeatation of names.
I want an output like this
chr1 12226559 12227059 TNFRSF1B
chr1 17051560 17052060
chr1 17053279 17053779
chr1 17338423 17338923 ATP13A2
chr1 19577574 19578074 EMC1
MRTO4
chr1 19578046 19578546 EMC1
MRTO4
chr1 19638239 19638739 AKR7A2
PQLC2
Things that I have tried
sort -k 4 -u file
awk '{if($4==temp1){next;}else{print}temp1=$4}' file
Nothing works :(
Please help
Thank you
You just need
awk '$NF != prev {print} {prev=$NF}'
EDIT: to handle the new input
awk '{
if (NF == 1)
value = $1
else {
key = $1 SUBSEP $2 SUBSEP $3
value = $4
}
if ((key SUBSEP value) in val)
next
print
val[key, value] = 1
}' input
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With