Remove duplicate lines only if the duplicate items are within 5 lines of each other

Tags:

awk

I want to delete the duplicate lines of a text file only if the duplicate items are within 5 lines of each other.

For example :

Chapter 1.1
Overview
Figure 1
Figure 2
Overview <- This should be deleted (ie. within 5 lines of the previous instance) 
Figure 3
Figure 4
...

(many lines in between)

Chapter 1.2
Overview <- This should not be deleted (ie. not within 5 lines of the previous instance)

I tried to use awk '!a[$0]++' but this will delete all the duplicates lines on the entire file. I also tried with a loop and sed -n "$startpoint,$endpoint p" file.txt | awk '!a[$0]++' but this actually creates new duplicates...

What other approaches can I try to remove the duplicates lines that are within 5 lines of each other?

856

asked Jun 16 '21 04:06

Pierre

2 Answers

You may use this shorter awk command:

awk '!NF || NR > rec[$0]; {rec[$0] = NR+5}' file

Chapter 1.1
Overview
Figure 1
Figure 2
Figure 3
Figure 4
...

(many lines in between)

Chapter 1.2
Figure 1
Figure 2
Overview

Algorithm Details:

!NF || NR > rec[$0];: Print each record if current line is empty OR if current record number is greater than the value we have in array rec for current record. When $0 doesn't exist in rec then also line will be printed. Line will not be printed only when we are within 5 lines from stored value in rec.
{rec[$0] = NR+5}: Save each record in array rec with value as current line no + 5

answered May 07 '23 04:05

anubhava

1st solution: A single Input_file pass solution.

awk '
{
  arr[FNR]=$0
}
END{
  for(i=1;i<=FNR;i++){
    count=0
    for(j=i;j>=(i-5);j--){
      if(arr[i]!=arr[j]){ count++      }
    }
    if(count==5)        { print arr[i] }
  }
}
'  Input_file

2nd solution: With your shown samples and a 2 pass of Input_file; one could try following also. Fair warning it could be slow for if dataset is huge.

awk '
FNR==NR{
  arr[FNR]=$0
  next
}
{
  count=0
  for(i=FNR;i>=(FNR-5);i--){
    if($0!=arr[i]){ count++ }
  }
  if(count==5)    { print   }
}
' Input_file Input_file

Explanation: Adding detailed explanation for above.

awk '                            ##Starting awk program from here.
FNR==NR{                         ##Checking condition which will be true 1st time Input_file is being read.
  arr[FNR]=$0                    ##Creating arr with index of current line number and value is current line.
  next                           ##next will skip all further statements from here.
}
{
  count=0                        ##Nullifying count here.
  for(i=FNR;i>=(FNR-5);i--){     ##Running a loop here for 5 count.
    if($0!=arr[i]){ count++ }    ##Checking condition if current line is not equal to array value then increase count with 1 here.
  }
  if(count==5)    { print   }    ##Checking condition if count is 5 then print line.
}
' Input_file Input_file          ##Mentioning Input_file names here.

3rd solution:

awk '!arr[$0]++;++count==5{delete arr;count=0}' Input_file

NOTE: 1st and 2nd solution considers that one wants to compare each line with its next 5 lines(eg: 1-6, 2-7 and so on....). Where 3rd solution considers that one wants to remove duplicate within each set of 5 lines(eg: 1-5, 6-10 and so on....)

answered May 07 '23 03:05

RavinderSingh13

Related questions
                            
                                how to get sub-expression value of regExp in awk?
                            
                                awk's $1 conflicts with $1 in shell script
                            
                                bash: getting percentage from a frequency table
                            
                                Too many open files error while running awk command
                            
                                awk to find number of columns for each line and exit if more than required
                            
                                How do I find the text that matches a pattern?
                            
                                switch/case doesn't work in awk
                            
                                Grep log file greater than time stamp
                            
                                awk: warning: escape sequence `\]' treated as plain `]'
                            
                                Using AWK to find the smallest and largest number in a column?
                            
                                How do I call the split function in awk to split a string on "\."?
                            
                                Using colored output for awk, or grep multiple pattern search in and condition
                            
                                Is there a field that stores the exact field separator FS used when in a regular expression, equivalent to RT for RS?
                            
                                How to separate words in a "sentence" with spaces?
                            
                                Command to sum 2nd colum of csv file
                            
                                SED - removing string followed by LineFeed (\n)
                            
                                Should I use cut or awk to extract fields and field substrings?
                            
                                Splitting bulk text file every n line
                            
                                Using awk to pull specific lines from a file
                            
                                How to split file by percentage of no. of lines?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With