Remove duplicate words/string from a tab separated file

Question

I want to remove duplicate words/strings from a large tab separated file using Linux commands.

names            john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick
cities            san jose, santa clara, san franscisco, new york, san jose, santa clara

The above is the file format, I want to retain the tabs and commas after removing the duplicate words.

names            john, cnn, mac, tommy, patrick, ngc, discovery, adam
cities            san jose, santa clara, san franscisco, new york

Any help would be appreciated.

Dennis Williamson · Accepted Answer

awk 'BEGIN {
         FS = ", |	"
     }
     {
          printf "%s	", $1
          delim = ""
          for (i = 2; i <= NF; i++) {
              if (! ($i in seen)) {
                  printf "%s%s", delim, $i
                  delim = ", "
              }
              seen[$i]
          }
          printf "
"
          delete seen
     }' inputfile

If you're not using GNU AWK (gawk) then you can't delete the array, use split("", array) instead.

Hari Menon · Answer

sed and awk by themselves aren't particularly well suited for this. uniq is better.

First pull out the names into another file, say names. You can use sed for this:

head -1 inputfile | sed 's/^names\s*//g' > names

So now names contains john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick.

Then use this:

awk 'BEGIN{RS=","}{print $0}' names | sort | uniq | awk 'BEGIN{ORS=","}{print $0}'

Output is adam,cnn,discovery,john,mac,ngc,patrick,tommy,. You can remove the last comma also if you want using sed. Of course you can pipe the output of the head command to the second awk also. In that case, you won't need the intermediate names file.

Same for cities. I am assuming order is not important for you.

Remove duplicate words/string from a tab separated file

Tags:

linux

sed

awk

Karthik

2 Answers

Dennis Williamson

Hari Menon

Recent Activity

Donate For Us

Remove duplicate words/string from a tab separated file

Tags:

linux

sed

awk

Karthik

2 Answers

Dennis Williamson

Hari Menon

Related questions

Recent Activity

Donate For Us