How do I find and remove duplicate lines from a file using Regular Expressions? [closed]

Tags:

regex

This question is meant to be language agnostic. Using only Regular Expressions, can I find and replace duplicate lines in a file?

Please consider the follwing example input and the output that I want;

Input>>

11 22 22  <-duplicate 33 44 44  <-duplicate 55

Output>>

11 22 33 44 55

462

asked Oct 15 '09 16:10

ebattulga

2 Answers

Regular-expressions.info has a page on Deleting Duplicate Lines From a File

This basically boils down to searching for this oneliner:

^(.*)(\r?\n\1)+$

... And replacing with \1.
_{Note: Dot must not match Newline}

Explanation:

The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The parentheses store the matched line into the first backreference.

Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.

Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text.

If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.

The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.

122

answered Sep 28 '22 21:09

Ben James

See my request for more info, I'm answering in the easy way now.

If the order doesn't matter, just a

sort -u

will do the trick
If the order does matter but you don't mind re-run multiple passes (this is vim syntax), you can use:

%s/\(.*\)\(\_.*\)\(\1\)/\2\1/g

to preserve the last occurrence, or

%s/\(.*\)\(\_.*\)\(\1\)/\1\2/g

to preserve the first occurrence.

If you do mind re-run multiple passes, than it's more difficult, so before we work on that, please say so in the question!

EDIT: in your edit you weren't very clear, but it looks like you want just a single-pass duplicate ADJACENT lines removal! Well, that's much easier!

A simple:

/(.*)\1*/\1/

(/\(.*\)\1*/\1/ in vim) i.e. searching for (.*)\1* and replacing it with just \1 will do the trick

answered Sep 28 '22 23:09

Davide

Related questions
                            
                                How to get domain name from URL
                            
                                Why do regex constructors need to be double escaped?
                            
                                Regex for string not containing multiple specific words
                            
                                How do you hide .git project directories?
                            
                                How to validate email id in angularJs using ng-pattern
                            
                                String replaceAll() vs. Matcher replaceAll() (Performance differences)
                            
                                Replace all double quotes within String
                            
                                pandas applying regex to replace values
                            
                                Regular Expression for password validation
                            
                                Regex date validation for yyyy-mm-dd [duplicate]
                            
                                Negate characters in Regular Expression [closed]
                            
                                String.Replace only replaces first occurrence of matched string. How to replace *all* occurrences?
                            
                                Regular Expression to match cross platform newline characters
                            
                                Canadian postal code validation
                            
                                String.split() *not* on regular expression?
                            
                                Is there a good, online, interactive regex tutorial? [closed]
                            
                                Find and replace whole words in vim
                            
                                How to split a string between letters and digits (or between digits and letters)?
                            
                                How do you unit test regular expressions?
                            
                                Warning: preg_replace(): Unknown modifier

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With