Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove lines from a list which can be found within other longer lines in the list?

Tags:

bash

sed

perl

I have a file, list.txt, like this:

cat
bear
tree
catfish
fish
bear

I need to delete any lines which are already fully found somewhere else in the document, either as a duplicate line, or found within another longer line. For e.g., the lines "bear" and "bear" are the same, so one of these is deleted; "cat" can be found completely within "catfish", so "cat" is deleted. The output would look like this:

catfish
tree
bear

How can I delete all duplicate lines including the lines which are found within longer lines in a list?

So far, I have this:

#!/bin/bash
touch list.tmp
while read -r line
do
    found="$(grep -c $line list.tmp)"
    if [ "$found" -eq "1" ]
    then
        echo $line >> list.tmp
        echo $line" added"
    else
        echo "Not added."
fi
done < list.txt
like image 586
Village Avatar asked Jul 21 '13 23:07

Village


3 Answers

If O(N^2) doesn't bother you:

#!/usr/bin/env perl

use strict;
use warnings;
use List::MoreUtils qw{any};

my @words;
for my $word (
    sort {length $b <=> length $a}
    do {
        my %words;
        my @words = <>;
        chomp @words;
        @words{@words} = ();
        keys %words;
    }
)
{
    push @words, $word unless do {
        my $re = qr/\Q$word/;
        any {m/$re/} @words;
    };
}

print "$_\n" for @words;

If you would like O(NlogN) you have to use some sort of trie approach. For example using suffix tree:

#!/usr/bin/env perl

use strict;
use warnings;
use Tree::Suffix;

my $tree = Tree::Suffix->new();

my @words;
for my $word (
    sort {length $b <=> length $a}
    do {
        my %words;
        my @words = <>;
        chomp @words;
        @words{@words} = ();
        keys %words;
    }
)
{
    unless ($tree->find($word)){
        push @words, $word;
        $tree->insert($word);
    };
}

print "$_\n" for @words;
like image 115
Hynek -Pichi- Vychodil Avatar answered Oct 05 '22 23:10

Hynek -Pichi- Vychodil


I can think of one fairly good algorithm. I will answer in Perl to keep the result sufficiently efficient.

For each word, test if it is a substring of any word in the set of larger words. If not, remove all those words from the set that are substrings of this word, and add the word to the set.

Because this generally implies looping through all values, we might as well use an array. To speed things up, we keep the array sorted in decreasing size. This allows us to test against each word already in the set exactly once.

use strict; use warnings;

my @words;
INPUT:
while (<>) {
  chomp;
  my $len = length;
  my $i = 0;

  # check larger words if they contain $_
  LARGER:
  for ( ; $i < @words ; $i++) {
    last LARGER if length $words[$i] < $len;
    next INPUT if 0 <= index $words[$i], $_; # the word was seen
  }

  # insert the new word
  splice @words, $i++, 0, $_;

  # remove words that are contained in new word
  for ( ; $i < @words ; $i++) {
    splice @words, $i--, 1 if 0 <= index $_, $words[$i]; # $i-- adjusts index for deletion
  }
}
print "$_\n" for @words;

The 0 <= index $a, $b is an efficient way to write $a =~ /\Q$b\E/.

This is a generalisation of David W.'s algorithm. If the input is sorted in decreasing wordlength, then both implementations produce the same output.


If the words are rather short, but there are many different words, it may be preferable to remember all possible substrings. This allows us to quickly detect a word as seen, but makes it expensive to add a word to the known list.

my %seen;  # used to detect seen words
my %words; # used to remember real words
while (<>) {
  chomp;
  next if exists $seen{$_};
  # so we didn't see it. Let's produce all substrings
  START: for (my $start = 0 ; $start < length() - 1 ; $start++) {
    LENGTH: for (my $length = length() - $start ; $length ; $length--) {
      my $substr = substr $_, $start, $length;
      delete $words{$substr};         # if this was a real word, it's now a substring
      last LENGTH if exists $seen{$substr};  # dont repeat yourself
      $seen{$substr} = undef;         # add the entry
    }
  }
  $words{$_} = undef;  # remember this word as a real word
}
undef %seen;  # free obscene amount of memory
print "$_\n" for keys %words;
like image 23
amon Avatar answered Oct 06 '22 01:10

amon


This requires two passes to the file but should work:

Content of script.awk

NR==FNR {
  words[$1]++
  next
} 
{
  for (word in words) { 
    if (index ($1,word) == 0) { 
      words[word] 
    } 
    else { 
      delete words[word]
      words[$1] 
    } 
  }
}
END {
  for (left in words)
    print left
}

Test:

$ cat file
cat
bear
tree
catfish
fish
bear
$ awk -f script.awk file file
bear
catfish
tree
like image 30
jaypal singh Avatar answered Oct 06 '22 00:10

jaypal singh