Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to the delete lines in a file which have no match in a second file?

I have two files, wordlist.txt and text.txt.

The first file, wordlist.txt, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:

你
你们
我

The second file, text.txt, contains long passages, e.g.:

你们要去哪里?
卡拉OK好不好?

I want to create a new word list (wordsfount.txt), but it should only contain those lines from wordlist.txt which are found at least once within text.txt. The output file from the above should show this:

你
你们

"我" is not found in this list because it is never found in text.txt.

I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.

I know a simple way in BASH to check each line in worlist.txt and see if it is in text.txt using grep:

a=1
while read line
do
    c=`grep -c $line text.txt`
    if [ "$c" -ge 1 ]
    then
    echo $line >> wordsfound.txt
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < wordlist.txt

Unfortunately, as wordlist.txt is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:

As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:

我
我们

Due to this fact, if "我" is never found within text.txt, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt that also contained within wordlist.txt. If there are about 8,000 unique characters found in wordlist.txt, then the script should not need to check so many lines.

What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?

like image 664
Village Avatar asked Mar 20 '12 02:03

Village


People also ask

Which command is used to delete file lines?

The rm command is used to delete files.


1 Answers

I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words which are also in war_and_peace.txt. You can change that with:

perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt

On my computer, it takes just over a second to run.

use strict;
use warnings;
use utf8::all;

use Getopt::Long;

my $wordlist = '/usr/share/dict/words';
my $text     = 'war_and_peace.txt';

GetOptions(
    "worlist=s" => \$wordlist,
    "text=s"    => \$text,
);

open my $text_fh, '<', $text
    or die "Cannot open '$text' for reading: $!";

my %is_in_text;
while ( my $line = <$text_fh> ) {
    chomp($line);

    # you will want to customize this line
    my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
    next unless @words;

    # This beasty uses the 'x' builtin in list context to assign
    # the value of 1 to all keys (the words)
    @is_in_text{@words} = (1) x @words;
}

open my $wordlist_fh, '<', $wordlist
    or die "Cannot open '$wordlist' for reading: $!";

while ( my $word = <$wordlist_fh> ) {
    chomp($word);
    if ( $is_in_text{$word} ) {
        print "$word\n";
    }
}

And here's my timing:

• [ovid] $ wc -w war_and_peace.txt 
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt 

real    0m1.081s
user    0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt 
15277 wordsfound.txt
like image 176
Ovid Avatar answered Oct 19 '22 06:10

Ovid