I have a file, list.txt
, like this:
cat
bear
tree
catfish
fish
bear
I need to delete any lines which are already fully found somewhere else in the document, either as a duplicate line, or found within another longer line. For e.g., the lines "bear" and "bear" are the same, so one of these is deleted; "cat" can be found completely within "catfish", so "cat" is deleted. The output would look like this:
catfish
tree
bear
How can I delete all duplicate lines including the lines which are found within longer lines in a list?
So far, I have this:
#!/bin/bash
touch list.tmp
while read -r line
do
found="$(grep -c $line list.tmp)"
if [ "$found" -eq "1" ]
then
echo $line >> list.tmp
echo $line" added"
else
echo "Not added."
fi
done < list.txt
If O(N^2) doesn't bother you:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw{any};
my @words;
for my $word (
sort {length $b <=> length $a}
do {
my %words;
my @words = <>;
chomp @words;
@words{@words} = ();
keys %words;
}
)
{
push @words, $word unless do {
my $re = qr/\Q$word/;
any {m/$re/} @words;
};
}
print "$_\n" for @words;
If you would like O(NlogN) you have to use some sort of trie approach. For example using suffix tree:
#!/usr/bin/env perl
use strict;
use warnings;
use Tree::Suffix;
my $tree = Tree::Suffix->new();
my @words;
for my $word (
sort {length $b <=> length $a}
do {
my %words;
my @words = <>;
chomp @words;
@words{@words} = ();
keys %words;
}
)
{
unless ($tree->find($word)){
push @words, $word;
$tree->insert($word);
};
}
print "$_\n" for @words;
I can think of one fairly good algorithm. I will answer in Perl to keep the result sufficiently efficient.
For each word, test if it is a substring of any word in the set of larger words. If not, remove all those words from the set that are substrings of this word, and add the word to the set.
Because this generally implies looping through all values, we might as well use an array. To speed things up, we keep the array sorted in decreasing size. This allows us to test against each word already in the set exactly once.
use strict; use warnings;
my @words;
INPUT:
while (<>) {
chomp;
my $len = length;
my $i = 0;
# check larger words if they contain $_
LARGER:
for ( ; $i < @words ; $i++) {
last LARGER if length $words[$i] < $len;
next INPUT if 0 <= index $words[$i], $_; # the word was seen
}
# insert the new word
splice @words, $i++, 0, $_;
# remove words that are contained in new word
for ( ; $i < @words ; $i++) {
splice @words, $i--, 1 if 0 <= index $_, $words[$i]; # $i-- adjusts index for deletion
}
}
print "$_\n" for @words;
The 0 <= index $a, $b
is an efficient way to write $a =~ /\Q$b\E/
.
This is a generalisation of David W.'s algorithm. If the input is sorted in decreasing wordlength, then both implementations produce the same output.
If the words are rather short, but there are many different words, it may be preferable to remember all possible substrings. This allows us to quickly detect a word as seen, but makes it expensive to add a word to the known list.
my %seen; # used to detect seen words
my %words; # used to remember real words
while (<>) {
chomp;
next if exists $seen{$_};
# so we didn't see it. Let's produce all substrings
START: for (my $start = 0 ; $start < length() - 1 ; $start++) {
LENGTH: for (my $length = length() - $start ; $length ; $length--) {
my $substr = substr $_, $start, $length;
delete $words{$substr}; # if this was a real word, it's now a substring
last LENGTH if exists $seen{$substr}; # dont repeat yourself
$seen{$substr} = undef; # add the entry
}
}
$words{$_} = undef; # remember this word as a real word
}
undef %seen; # free obscene amount of memory
print "$_\n" for keys %words;
This requires two passes to the file but should work:
NR==FNR {
words[$1]++
next
}
{
for (word in words) {
if (index ($1,word) == 0) {
words[word]
}
else {
delete words[word]
words[$1]
}
}
}
END {
for (left in words)
print left
}
$ cat file
cat
bear
tree
catfish
fish
bear
$ awk -f script.awk file file
bear
catfish
tree
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With