Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate words from a plain text file using linux command

Tags:

I have a plain text file with words, which are separated by comma, for example:

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3 

i want to delete the duplicates and to become:

word1, word2, word3, word4, word5, word6, word7 

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

like image 780
cupakob Avatar asked Jun 04 '09 18:06

cupakob


People also ask

How do I remove duplicates from a text file in Linux?

The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead only print duplicate lines.

Which command is used to remove the duplicate records in file?

Uniq command is helpful to remove or detect duplicate entries in a file.

How do you find repeated words in Linux?

The uniq command in Linux is used to display identical lines in a text file. This command can be helpful if you want to remove duplicate words or strings from a text file. Since the uniq command matches adjacent lines for finding redundant copies, it only works with sorted text files.


1 Answers

Assuming that the words are one per line, and the file is already sorted:

uniq filename 

If the file's not sorted:

sort filename | uniq 

If they're not one per line, and you don't mind them being one per line:

tr -s [:space:] \\n < filename | sort | uniq 

That doesn't remove punctuation, though, so maybe you want:

tr -s [:space:][:punct:] \\n < filename | sort | uniq 

But that removes the hyphen from hyphenated words. "man tr" for more options.

like image 130
Randy Orrison Avatar answered Oct 06 '22 10:10

Randy Orrison