I have two files, wordlist.txt
and text.txt
.
The first file, wordlist.txt
, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:
你
你们
我
The second file, text.txt
, contains long passages, e.g.:
你们要去哪里?
卡拉OK好不好?
I want to create a new word list (wordsfount.txt
), but it should only contain those lines from wordlist.txt
which are found at least once within text.txt
. The output file from the above should show this:
你
你们
"我" is not found in this list because it is never found in text.txt
.
I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.
I know a simple way in BASH to check each line in worlist.txt
and see if it is in text.txt
using grep
:
a=1
while read line
do
c=`grep -c $line text.txt`
if [ "$c" -ge 1 ]
then
echo $line >> wordsfound.txt
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < wordlist.txt
Unfortunately, as wordlist.txt
is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:
As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:
我
我们
Due to this fact, if "我" is never found within text.txt
, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt
that also contained within wordlist.txt
. If there are about 8,000 unique characters found in wordlist.txt
, then the script should not need to check so many lines.
What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?
The rm command is used to delete files.
I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words
which are also in war_and_peace.txt
. You can change that with:
perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt
On my computer, it takes just over a second to run.
use strict;
use warnings;
use utf8::all;
use Getopt::Long;
my $wordlist = '/usr/share/dict/words';
my $text = 'war_and_peace.txt';
GetOptions(
"worlist=s" => \$wordlist,
"text=s" => \$text,
);
open my $text_fh, '<', $text
or die "Cannot open '$text' for reading: $!";
my %is_in_text;
while ( my $line = <$text_fh> ) {
chomp($line);
# you will want to customize this line
my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
next unless @words;
# This beasty uses the 'x' builtin in list context to assign
# the value of 1 to all keys (the words)
@is_in_text{@words} = (1) x @words;
}
open my $wordlist_fh, '<', $wordlist
or die "Cannot open '$wordlist' for reading: $!";
while ( my $word = <$wordlist_fh> ) {
chomp($word);
if ( $is_in_text{$word} ) {
print "$word\n";
}
}
And here's my timing:
• [ovid] $ wc -w war_and_peace.txt
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt
real 0m1.081s
user 0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt
15277 wordsfound.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With