Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find and KEEP all DUPLICATE lines (instead of unique lines) in a text file

I am aiming to identify and keep DUPLICATE, TRIPLICATE, etc. lines, i.e., all lines that occur more than once in Notepad++? In other words, how can I delete all unique lines only?

For example, here are seven (7) separate lists and the desired true duplicate lines of each lists (shown as 7 columns, regard each column as an individual list or file!). (The lists here are shown side by side only to save space, in real life, each of the 7 lists occurs alone and independently from the others and are separate files!)

list1  list2  list3  list4  list5  list6  list7
1      0      0      0      0      0      0
2      1      1      1      1      1      1
3      2      2      2      2      2      2
4      3      3      3      3      3      3
4      4      4      4      4      4      4
4      4      4      4      4      4      4
5      4      4      4      4      4      4
6      5      5      5      5      5      5
7      5      5      5      5      5      5
8      6      6      6      6      6      6
9      6      6      6      6      6      6
abc    7      7      7      7      7      7
abd    8      8      8      8      8      8
abd    9      9      9      9      9      9
abe           <CR>   9      9      9      9
                            <CR>   99     99
                                          <CR>

[Lines of multiple occurence of above lists:]         
4      4      4      4      4      4      4
4      4      4      4      4      4      4
4      4      4      4      4      4      4
abd    5      5      5      5      5      5
abd    5      5      5      5      5      5
       6      6      6      6      6      6
       6      6      6      6      6      6
                     9      9      9      9
                     9      9      9      9

There are many solutions to eliminate duplicates (e.g., TextFX; notepad++ delete duplicate and original lines to keep unique lines), I can not find solutions to keep duplicates only.

((.*)\R(\2\R)+)*\K.+\R @Lars Fischer: This script works nearly OK, except the last entry of the (presorted) list needs to be unique line followed by a <CR> empty line. One (suboptimal) workaround is to insert an artificial (helper) unique line (e.g., zzz) followed by an empty line <CR> as the last two lines.

(END OF QUESTION)


UPDATE 3: This question is reposted per stackoverflow "ask a new question" instruction. (@AdrianHHH, @B. Desai, @Paolo Forgia, @greg-449, @Erik von Asmuth draw the incorrect conclusion that this question is a duplicate of notepad++ delete duplicate and original lines to keep unique lines. This question is definitely not a duplicate of the one @AdrianHHH et al quotes. History.

UPDATE 2: @AdrianHHH This question is not less "broad" (in fact, one can hardly be more specific) or less researched than other Notepad++ questions, including the one https://stackoverflow.com/questions/29303148 cited (wrongly) by @AdrianHHH et al. as the same question.

UPDATE: @AdrianHHH, @B. Desai, @Paolo Forgia, @greg-449, @Erik von Asmuth This questions is different from: https://stackoverflow.com/questions/29303148 beacuse Q 29303148 is (i) neither asking how to identify and keep only the lines of multiple occurrence, (ii) neither there is a solution provided in the answers for that. Q 29303148 asks "...I just need the unique lines."

like image 499
user3026965 Avatar asked Oct 13 '17 09:10

user3026965


People also ask

How do you remove duplicate lines from the file for using unique?

Remove duplicate lines with uniq If you don't need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.


1 Answers

Here is a solution based on regular Expressions and bookmarks, it works for a sorted file (i.e. each duplicated line is followed by its duplicates):

  • Open the Mark Dialog (Search -> Mark ....)
  • click Clear all Marks on the right
  • check Bookmark line
  • check Wrap aound
  • Find What: ((.*)\R(\2\R?)+)*\K.*
  • Check regular expression and uncheck . matches newline
  • Mark All
  • Click Close
  • Search -> Bookmark -> Remove Bookmarked Lines

Explanation

The regular expression is made up of three parts:

  • ((.*)\R(\2\R?)+)* : this is an optional block of duplicates consisting of one ore more line blocks

    • the outher ( ... )* matches zero or more such blocks of duplicated lines (if in your example the three 4 would be followed by two 5 we will need a concept of sequences of duplicate blocks)
    • (.*)\R(\2\R?)+: \2 references the content of (.*): this are all duplicates of one line
    • the second \R is an optional ( due to the ?) linebreak. Thus it is possible to match a duplicate in the last line of the file if that line does not end with a linebreak

    If there is a block of duplicated lines after the cursor position from which you start, this will match it.

  • now \K discards what we have matched so far (the duplicates) and "puts the cursor" before the first unique line

  • .* matches the next (unique) line and bookmarks it

Using Mark All we bookmark all such unique lines, so that we can remove them using the Entry from the Search -> Bookmark menu.

like image 124
Lars Fischer Avatar answered Oct 02 '22 09:10

Lars Fischer