I want to remove duplicate words/strings from a large tab separated file using Linux commands.
names john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick
cities san jose, santa clara, san franscisco, new york, san jose, santa clara
The above is the file format, I want to retain the tabs and commas after removing the duplicate words.
names john, cnn, mac, tommy, patrick, ngc, discovery, adam
cities san jose, santa clara, san franscisco, new york
Any help would be appreciated.
awk 'BEGIN {
FS = ", |\t"
}
{
printf "%s\t", $1
delim = ""
for (i = 2; i <= NF; i++) {
if (! ($i in seen)) {
printf "%s%s", delim, $i
delim = ", "
}
seen[$i]
}
printf "\n"
delete seen
}' inputfile
If you're not using GNU AWK (gawk) then you can't delete the array, use split("", array) instead.
sed and awk by themselves aren't particularly well suited for this. uniq is better.
First pull out the names into another file, say names. You can use sed for this:
head -1 inputfile | sed 's/^names\s*//g' > names
So now names contains john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick.
Then use this:
awk 'BEGIN{RS=","}{print $0}' names | sort | uniq | awk 'BEGIN{ORS=","}{print $0}'
Output is adam,cnn,discovery,john,mac,ngc,patrick,tommy,. You can remove the last comma also if you want using sed. Of course you can pipe the output of the head command to the second awk also. In that case, you won't need the intermediate names file.
Same for cities. I am assuming order is not important for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With