Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate words/string from a tab separated file

Tags:

linux

sed

awk

I want to remove duplicate words/strings from a large tab separated file using Linux commands.

names            john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick
cities            san jose, santa clara, san franscisco, new york, san jose, santa clara

The above is the file format, I want to retain the tabs and commas after removing the duplicate words.

names            john, cnn, mac, tommy, patrick, ngc, discovery, adam
cities            san jose, santa clara, san franscisco, new york

Any help would be appreciated.

like image 844
Karthik Avatar asked Feb 18 '26 06:02

Karthik


2 Answers

awk 'BEGIN {
         FS = ", |\t"
     }
     {
          printf "%s\t", $1
          delim = ""
          for (i = 2; i <= NF; i++) {
              if (! ($i in seen)) {
                  printf "%s%s", delim, $i
                  delim = ", "
              }
              seen[$i]
          }
          printf "\n"
          delete seen
     }' inputfile

If you're not using GNU AWK (gawk) then you can't delete the array, use split("", array) instead.

like image 137
Dennis Williamson Avatar answered Feb 20 '26 19:02

Dennis Williamson


sed and awk by themselves aren't particularly well suited for this. uniq is better.

First pull out the names into another file, say names. You can use sed for this:

head -1 inputfile | sed 's/^names\s*//g' > names

So now names contains john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick.

Then use this:

awk 'BEGIN{RS=","}{print $0}' names | sort | uniq | awk 'BEGIN{ORS=","}{print $0}'

Output is adam,cnn,discovery,john,mac,ngc,patrick,tommy,. You can remove the last comma also if you want using sed. Of course you can pipe the output of the head command to the second awk also. In that case, you won't need the intermediate names file.

Same for cities. I am assuming order is not important for you.

like image 44
Hari Menon Avatar answered Feb 20 '26 19:02

Hari Menon