I want to delete the duplicate lines of a text file only if the duplicate items are within 5 lines of each other.
For example :
Chapter 1.1
Overview
Figure 1
Figure 2
Overview <- This should be deleted (ie. within 5 lines of the previous instance)
Figure 3
Figure 4
...
(many lines in between)
Chapter 1.2
Overview <- This should not be deleted (ie. not within 5 lines of the previous instance)
I tried to use awk '!a[$0]++'
but this will delete all the duplicates lines on the entire file. I also tried with a loop and sed -n "$startpoint,$endpoint p" file.txt | awk '!a[$0]++'
but this actually creates new duplicates...
What other approaches can I try to remove the duplicates lines that are within 5 lines of each other?
Remove duplicate lines with uniq If you don't need to preserve the order of the lines in the file, using the sort and uniq commands will do what you need in a very straightforward way. The sort command sorts the lines in alphanumeric order. The uniq command ensures that sequential identical lines are reduced to one.
You may use this shorter awk
command:
awk '!NF || NR > rec[$0]; {rec[$0] = NR+5}' file
Chapter 1.1
Overview
Figure 1
Figure 2
Figure 3
Figure 4
...
(many lines in between)
Chapter 1.2
Figure 1
Figure 2
Overview
Algorithm Details:
!NF || NR > rec[$0];
: Print each record if current line is empty OR if current record number is greater than the value we have in array rec
for current record. When $0
doesn't exist in rec
then also line will be printed. Line will not be printed only when we are within 5
lines from stored value in rec
.{rec[$0] = NR+5}
: Save each record in array rec
with value as current line no + 5
1st solution: A single Input_file pass solution.
awk '
{
arr[FNR]=$0
}
END{
for(i=1;i<=FNR;i++){
count=0
for(j=i;j>=(i-5);j--){
if(arr[i]!=arr[j]){ count++ }
}
if(count==5) { print arr[i] }
}
}
' Input_file
2nd solution: With your shown samples and a 2 pass of Input_file; one could try following also. Fair warning it could be slow for if dataset is huge.
awk '
FNR==NR{
arr[FNR]=$0
next
}
{
count=0
for(i=FNR;i>=(FNR-5);i--){
if($0!=arr[i]){ count++ }
}
if(count==5) { print }
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be true 1st time Input_file is being read.
arr[FNR]=$0 ##Creating arr with index of current line number and value is current line.
next ##next will skip all further statements from here.
}
{
count=0 ##Nullifying count here.
for(i=FNR;i>=(FNR-5);i--){ ##Running a loop here for 5 count.
if($0!=arr[i]){ count++ } ##Checking condition if current line is not equal to array value then increase count with 1 here.
}
if(count==5) { print } ##Checking condition if count is 5 then print line.
}
' Input_file Input_file ##Mentioning Input_file names here.
3rd solution:
awk '!arr[$0]++;++count==5{delete arr;count=0}' Input_file
NOTE: 1st and 2nd solution considers that one wants to compare each line with its next 5 lines(eg: 1-6, 2-7 and so on....). Where 3rd solution considers that one wants to remove duplicate within each set of 5 lines(eg: 1-5, 6-10 and so on....)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With