Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate lines only if the duplicate items are within 5 lines of each other

Tags:

awk

I want to delete the duplicate lines of a text file only if the duplicate items are within 5 lines of each other.

For example :

Chapter 1.1
Overview
Figure 1
Figure 2
Overview <- This should be deleted (ie. within 5 lines of the previous instance) 
Figure 3
Figure 4
...

(many lines in between)

Chapter 1.2
Overview <- This should not be deleted (ie. not within 5 lines of the previous instance)

I tried to use awk '!a[$0]++' but this will delete all the duplicates lines on the entire file. I also tried with a loop and sed -n "$startpoint,$endpoint p" file.txt | awk '!a[$0]++' but this actually creates new duplicates...

What other approaches can I try to remove the duplicates lines that are within 5 lines of each other?

like image 856
Pierre Avatar asked Jun 16 '21 04:06

Pierre


People also ask

How do I remove duplicate lines in files?

Remove duplicate lines with uniq If you don't need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.


2 Answers

You may use this shorter awk command:

awk '!NF || NR > rec[$0]; {rec[$0] = NR+5}' file

Chapter 1.1
Overview
Figure 1
Figure 2
Figure 3
Figure 4
...

(many lines in between)

Chapter 1.2
Figure 1
Figure 2
Overview

Algorithm Details:

  • !NF || NR > rec[$0];: Print each record if current line is empty OR if current record number is greater than the value we have in array rec for current record. When $0 doesn't exist in rec then also line will be printed. Line will not be printed only when we are within 5 lines from stored value in rec.
  • {rec[$0] = NR+5}: Save each record in array rec with value as current line no + 5
like image 96
anubhava Avatar answered May 07 '23 04:05

anubhava


1st solution: A single Input_file pass solution.

awk '
{
  arr[FNR]=$0
}
END{
  for(i=1;i<=FNR;i++){
    count=0
    for(j=i;j>=(i-5);j--){
      if(arr[i]!=arr[j]){ count++      }
    }
    if(count==5)        { print arr[i] }
  }
}
'  Input_file


2nd solution: With your shown samples and a 2 pass of Input_file; one could try following also. Fair warning it could be slow for if dataset is huge.

awk '
FNR==NR{
  arr[FNR]=$0
  next
}
{
  count=0
  for(i=FNR;i>=(FNR-5);i--){
    if($0!=arr[i]){ count++ }
  }
  if(count==5)    { print   }
}
' Input_file Input_file

Explanation: Adding detailed explanation for above.

awk '                            ##Starting awk program from here.
FNR==NR{                         ##Checking condition which will be true 1st time Input_file is being read.
  arr[FNR]=$0                    ##Creating arr with index of current line number and value is current line.
  next                           ##next will skip all further statements from here.
}
{
  count=0                        ##Nullifying count here.
  for(i=FNR;i>=(FNR-5);i--){     ##Running a loop here for 5 count.
    if($0!=arr[i]){ count++ }    ##Checking condition if current line is not equal to array value then increase count with 1 here.
  }
  if(count==5)    { print   }    ##Checking condition if count is 5 then print line.
}
' Input_file Input_file          ##Mentioning Input_file names here.


3rd solution:

awk '!arr[$0]++;++count==5{delete arr;count=0}' Input_file

NOTE: 1st and 2nd solution considers that one wants to compare each line with its next 5 lines(eg: 1-6, 2-7 and so on....). Where 3rd solution considers that one wants to remove duplicate within each set of 5 lines(eg: 1-5, 6-10 and so on....)

like image 25
RavinderSingh13 Avatar answered May 07 '23 03:05

RavinderSingh13