Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I find and remove duplicate lines from a file using Regular Expressions? [closed]

Tags:

regex

This question is meant to be language agnostic. Using only Regular Expressions, can I find and replace duplicate lines in a file?

Please consider the follwing example input and the output that I want;

Input>>

11 22 22  <-duplicate 33 44 44  <-duplicate 55 

Output>>

11 22 33 44 55 
like image 462
ebattulga Avatar asked Oct 15 '09 16:10

ebattulga


People also ask

How do I remove duplicate lines in files?

The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead only print duplicate lines.

How do I find duplicate lines in a text file in Linux?

The uniq command in Linux is used to display identical lines in a text file. This command can be helpful if you want to remove duplicate words or strings from a text file. Since the uniq command matches adjacent lines for finding redundant copies, it only works with sorted text files.


2 Answers

Regular-expressions.info has a page on Deleting Duplicate Lines From a File

This basically boils down to searching for this oneliner:

^(.*)(\r?\n\1)+$ 

... And replacing with \1.
Note: Dot must not match Newline

Explanation:

The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The parentheses store the matched line into the first backreference.

Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.

Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text.

If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.

The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.

like image 122
Ben James Avatar answered Sep 28 '22 21:09

Ben James


See my request for more info, I'm answering in the easy way now.

  1. If the order doesn't matter, just a

    sort -u

    will do the trick

  2. If the order does matter but you don't mind re-run multiple passes (this is vim syntax), you can use:

    %s/\(.*\)\(\_.*\)\(\1\)/\2\1/g

    to preserve the last occurrence, or

    %s/\(.*\)\(\_.*\)\(\1\)/\1\2/g

    to preserve the first occurrence.

If you do mind re-run multiple passes, than it's more difficult, so before we work on that, please say so in the question!

EDIT: in your edit you weren't very clear, but it looks like you want just a single-pass duplicate ADJACENT lines removal! Well, that's much easier!

A simple:

/(.*)\1*/\1/ 

(/\(.*\)\1*/\1/ in vim) i.e. searching for (.*)\1* and replacing it with just \1 will do the trick

like image 39
Davide Avatar answered Sep 28 '22 23:09

Davide