I have a list of around 300k common words stored in an array. So, 1 element of the array = 1 word.
On the other side, I have a huge list of strings that MAY contain one or more of these 300k words inside them. A sample string would be: ifdxawesome453
.
Now, I need to check each of these long strings against the common words. If a word is found within that string, immediately return. So, I need to check the 300k words again ifdxawesome453
and see if any of them is contained within it.
So what I do is:
huge_list_of_words.any? do |word|
random_long_word.include?(word)
end
While this is okay for small samples of random long words, if I have millions of them, suddenly it takes hours to finish the job.
Is there a way to do this faster? The only way I thought of is if I sample out, say 10k most common words out of these 300k and compare against it first, and if no match is found, compare against the full list.
Another way that drastically sped things up was to group the array of the 300k words by size. When I then compare the long random word against it, I first check if the word's size and filter out any longer words. I am then left with the indexes of the equal size or less words, and search against them, starting from the word with the smallest size.
Trie
structure is a step in the right direction. SuffixTree
might also help.
It looks like Triez
gem has more features than Trie
gem, but the documentation is far from being complete. :substring
sounds perfect, but it seems you can only use it in change_all
:
# gem install triez
require 'triez'
huge_list_of_words = Triez.new value_type: :object, default: nil
%w(awesome someword anotherword).each do |word|
huge_list_of_words[word] = word
end
class String
def contains_word_from_dict?(dict)
dict.change_all(:substring, self) do |v|
return v if v
end
nil
end
end
'ifdxawesome45someword3'.contains_word_from_dict?(huge_list_of_words)
# => "awesome"
'ifdxawsome45someword3'.contains_word_from_dict?(huge_list_of_words)
# => "someword"
'ifdxawsome45sameword3'.contains_word_from_dict?(huge_list_of_words)
# => nil
I tried it with a bigger dictionary (~100k words) and a million lookups :
huge_list_of_words = Triez.new value_type: :object, default: nil
dict = '/usr/share/dict/american-english'
File.foreach(dict) do |word|
word.chomp!
huge_list_of_words[word] = word if word.size > 4 # avoid finding `if` or `word`
end
1_000_000.times do
'ifdxawesome45someword3'.contains_word_from_dict?(huge_list_of_words)
end
It returned after 22s on my slowish laptop.
To be very honest, I don't understand how change_all
works, and what its purpose is. It does seem to work fine for your purpose though! ¯\_(ツ)_/¯
A benchmark between the Trie gem and Triez gem in this particular use case:
word count: 228982
user system total real
trie 13.410000 0.050000 13.460000 ( 13.463473)
triez 11.080000 0.010000 11.090000 ( 11.102195)
trie tail 39.920000 0.140000 40.060000 ( 40.102285)
triez tail 28.960000 0.030000 28.990000 ( 29.022630)
Generally, Triez is faster for the Op's use case.
require 'triez'
require 'trie'
require 'benchmark'
DICT = '/usr/share/dict/web2'
triez = Triez.new value_type: :object, default: nil
trie = Trie.new
count = 0
File.foreach(DICT) do |word|
word.chomp!
if word.size > 4
triez[word] = word
trie.add word
count += 1
end
end
puts "word count: #{count}"
def in_trie?(str, trie)
0.upto(str.length - 1) do |i|
node = trie.root
i.upto(str.length - 1) do |j|
break unless node.walk! str[j]
if node.terminal?
return str[i..j]
end
end
end
nil
end
def in_triez?(str, triez)
triez.change_all(:substring, str) do |v|
return v if v
end
nil
end
Benchmark.bm(12) do |b|
b.report('trie') do
1_000_000.times { in_trie?('ifdxawesome45someword3', trie) }
end
b.report('triez') do
1_000_000.times { in_triez?('ifdxawesome45someword3', triez) }
end
b.report('trie tail') do
1_000_000.times { in_trie?('ifdx45someword3awesome', trie) }
end
b.report('triez tail') do
1_000_000.times { in_triez?('ifdx45someword3awesome', triez) }
end
end
UPDATE benchmark for rambling-trie, where the lines with c
prefix is the compressed version. (NOTE: the ROUND has been reduced to 100K rather than 1M in the prefix benchmark)
Word count: 228982, ROUND: 100000
user system total real
trie 1.510000 0.000000 1.510000 ( 1.511772)
triez 1.170000 0.000000 1.170000 ( 1.176075)
rambling 4.800000 0.010000 4.810000 ( 4.847021)
c rambling 25.060000 0.050000 25.110000 ( 25.172771)
trie tail 4.540000 0.010000 4.550000 ( 4.566233)
triez tail 3.080000 0.010000 3.090000 ( 3.092655)
rambling tail 4.780000 0.010000 4.790000 ( 4.803114)
c rambling tail 23.470000 0.020000 23.490000 ( 23.525066)
It seems rambling-trie is implemented purely in Ruby, and it doesn't offer direct methods to do prefix matching. The following monkey patches need to be added first. There may be better implementation, but I didn't dig further.
class Rambling::Trie::Container
def match_prefix?(str)
root.match_prefix?(str.chars)
end
end
class Rambling::Trie::RawNode
def match_prefix?(chars, i = 0)
if children_tree.empty?
true
elsif i >= chars.size
false
else
letter = chars[i].to_sym
child = children_tree[letter]
!!child && child.match_prefix?(chars, i + 1)
end
end
end
class Rambling::Trie::CompressedNode
def match_prefix?(chars)
if children_tree.empty?
true
if chars.empty?
false
else
!!(recursive_get :match_prefix?, chars)
end
end
end
def in_r_trie?(str, r_trie)
0.upto(str.length - 1) do |i|
if r_trie.match_prefix? str[i..-1]
return true
end
end
false
end
The fastesy way is to create a trie for all the words in the huge list. Then you need loop through all the long strings. For a particular string, do custom iteration to start searching the trie for each letter in the string to match all your words. You can terminate early if you only need to find one contained word for extra speed.
For the trie part there seems to be a sweet implementation in ruby here. It even has a example, very similar to your use case that you should be able to tweak.
word = 'forestry'
node = trie.root
word.split('').each do |char|
break unless node.walk!(char)
if node.terminal?
puts "Found me a word: #{node.full_state}"
end
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With