I have two files, <code>wordlist.txt</code> and <code>text.txt</code>. The first file, <code>wordlist.txt</code>, contains a huge list of words in Chinese, Japanese, and Korean, e.g.: <pre class="prettyprint"><code>你你们我 </code></pre> The second file, <code>text.txt</code>, contains long passages, e.g.: <pre class="prettyprint"><code>你们要去哪里？卡拉OK好不好？ </code></pre> I want to create a new word list (<code>wordsfount.txt</code>), but it should only contain those lines from <code>wordlist.txt</code> which are found at least once within <code>text.txt</code>. The output file from the above should show this: <pre class="prettyprint"><code>你你们 </code></pre> "我" is not found in this list because it is never found in <code>text.txt</code>. I want to find a very fast way to create this list which only contains lines from the first file that are found in the second. I know a simple way in BASH to check each line in <code>worlist.txt</code> and see if it is in <code>text.txt</code> using <code>grep</code>: <pre class="prettyprint"><code>a=1 while read line do c=`grep -c $line text.txt` if [ "$c" -ge 1 ] then echo $line >> wordsfound.txt echo "Found" $a fi echo "Not found" $a a=`expr $a + 1` done < wordlist.txt </code></pre> Unfortunately, as <code>wordlist.txt</code> is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration: As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.: <pre class="prettyprint"><code>我我们 </code></pre> Due to this fact, if "我" is never found within <code>text.txt</code>, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing <code>wordlist.txt</code> that also contained within <code>wordlist.txt</code>. If there are about 8,000 unique characters found in <code>wordlist.txt</code>, then the script should not need to check so many lines. What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?

I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in <code>/usr/share/dict/words</code> which are also in <code>war_and_peace.txt</code>. You can change that with: <pre class="prettyprint"><code>perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt </code></pre> On my computer, it takes just over a second to run. <pre class="prettyprint"><code>use strict; use warnings; use utf8::all; use Getopt::Long; my $wordlist = '/usr/share/dict/words'; my $text = 'war_and_peace.txt'; GetOptions( "worlist=s" => \$wordlist, "text=s" => \$text, ); open my $text_fh, '<', $text or die "Cannot open '$text' for reading: $!"; my %is_in_text; while ( my $line = <$text_fh> ) { chomp($line); # you will want to customize this line my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line; next unless @words; # This beasty uses the 'x' builtin in list context to assign # the value of 1 to all keys (the words) @is_in_text{@words} = (1) x @words; } open my $wordlist_fh, '<', $wordlist or die "Cannot open '$wordlist' for reading: $!"; while ( my $word = <$wordlist_fh> ) { chomp($word); if ( $is_in_text{$word} ) { print "$word\n"; } } </code></pre> And here's my timing: <pre class="prettyprint"><code>• [ovid] $ wc -w war_and_peace.txt 565450 war_and_peace.txt • [ovid] $ time perl findwords.pl > wordsfound.txt real 0m1.081s user 0m1.076s sys 0m0.000s • [ovid] $ wc -w wordsfound.txt 15277 wordsfound.txt </code></pre>

What is the fastest way to the delete lines in a file which have no match in a second file?

Tags:

bash

ruby

python-2.7

perl

cjk

I have two files, wordlist.txt and text.txt.

The first file, wordlist.txt, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:

你
你们
我

The second file, text.txt, contains long passages, e.g.:

你们要去哪里？
卡拉OK好不好？

I want to create a new word list (wordsfount.txt), but it should only contain those lines from wordlist.txt which are found at least once within text.txt. The output file from the above should show this:

你
你们

"我" is not found in this list because it is never found in text.txt.

I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.

I know a simple way in BASH to check each line in worlist.txt and see if it is in text.txt using grep:

a=1
while read line
do
    c=`grep -c $line text.txt`
    if [ "$c" -ge 1 ]
    then
    echo $line >> wordsfound.txt
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < wordlist.txt

Unfortunately, as wordlist.txt is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:

As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:

我
我们

Due to this fact, if "我" is never found within text.txt, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt that also contained within wordlist.txt. If there are about 8,000 unique characters found in wordlist.txt, then the script should not need to check so many lines.

What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?

664

asked Mar 20 '12 02:03

Village

1 Answers

I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words which are also in war_and_peace.txt. You can change that with:

perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt

On my computer, it takes just over a second to run.

use strict;
use warnings;
use utf8::all;

use Getopt::Long;

my $wordlist = '/usr/share/dict/words';
my $text     = 'war_and_peace.txt';

GetOptions(
    "worlist=s" => \$wordlist,
    "text=s"    => \$text,
);

open my $text_fh, '<', $text
    or die "Cannot open '$text' for reading: $!";

my %is_in_text;
while ( my $line = <$text_fh> ) {
    chomp($line);

    # you will want to customize this line
    my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
    next unless @words;

    # This beasty uses the 'x' builtin in list context to assign
    # the value of 1 to all keys (the words)
    @is_in_text{@words} = (1) x @words;
}

open my $wordlist_fh, '<', $wordlist
    or die "Cannot open '$wordlist' for reading: $!";

while ( my $word = <$wordlist_fh> ) {
    chomp($word);
    if ( $is_in_text{$word} ) {
        print "$word\n";
    }
}

And here's my timing:

• [ovid] $ wc -w war_and_peace.txt 
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt 

real    0m1.081s
user    0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt 
15277 wordsfound.txt

176

answered Oct 19 '22 06:10

Ovid

Related questions
                            
                                cap command not found
                            
                                Rails: how to get a file extension/postfix based on the mime type
                            
                                Rails runner without spring
                            
                                to_specs': Could not find chef (>= 0) amongst [] (Gem::LoadError)
                            
                                factory_girl + rspec doesn't seem to roll back changes after each example
                            
                                Rails: Multiple if conditions in validation
                            
                                Rails SSL issue: (https://example.com) didn't match request.base_url (http://example.com)
                            
                                Factory methods in Ruby
                            
                                Converting External CSS to Inline CSS for Mail in Rails
                            
                                Delete a record from console -- Ruby on Rails
                            
                                Undefined method `accept' for nil:NilClass on rake db:migrate
                            
                                Error when doing rake db:migrate on Heroku
                            
                                Generate letters to represent number using ruby?
                            
                                Multiline strings with no indent
                            
                                How can I sort by multiple conditions with different orders?
                            
                                ruby keyword arguments of method
                            
                                Ruby on Rails: Using XML Builder Partials
                            
                                How to ping for reachability of remote host in Ruby
                            
                                Join array contents into an 'English list'
                            
                                Rails + Twitter Bootstrap: File to import not found or unreadable: twitter/bootstrap

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With