Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to see if a huge list of strings is contained within another string

I have a list of around 300k common words stored in an array. So, 1 element of the array = 1 word.

On the other side, I have a huge list of strings that MAY contain one or more of these 300k words inside them. A sample string would be: ifdxawesome453.

Now, I need to check each of these long strings against the common words. If a word is found within that string, immediately return. So, I need to check the 300k words again ifdxawesome453 and see if any of them is contained within it.

So what I do is:

huge_list_of_words.any? do |word|
  random_long_word.include?(word)
end

While this is okay for small samples of random long words, if I have millions of them, suddenly it takes hours to finish the job.

Is there a way to do this faster? The only way I thought of is if I sample out, say 10k most common words out of these 300k and compare against it first, and if no match is found, compare against the full list.

Another way that drastically sped things up was to group the array of the 300k words by size. When I then compare the long random word against it, I first check if the word's size and filter out any longer words. I am then left with the indexes of the equal size or less words, and search against them, starting from the word with the smallest size.

like image 599
anemaria20 Avatar asked Jan 07 '17 12:01

anemaria20


3 Answers

Solution

Trie structure is a step in the right direction. SuffixTree might also help.

It looks like Triez gem has more features than Trie gem, but the documentation is far from being complete. :substring sounds perfect, but it seems you can only use it in change_all :

# gem install triez
require 'triez'

huge_list_of_words = Triez.new value_type: :object, default: nil

%w(awesome someword anotherword).each do |word|
  huge_list_of_words[word] = word
end

class String
  def contains_word_from_dict?(dict)
    dict.change_all(:substring, self) do |v|
      return v if v
    end
    nil
  end
end

'ifdxawesome45someword3'.contains_word_from_dict?(huge_list_of_words)
# => "awesome"
'ifdxawsome45someword3'.contains_word_from_dict?(huge_list_of_words)
# => "someword"
'ifdxawsome45sameword3'.contains_word_from_dict?(huge_list_of_words)
# => nil

Test

I tried it with a bigger dictionary (~100k words) and a million lookups :

huge_list_of_words = Triez.new value_type: :object, default: nil

dict = '/usr/share/dict/american-english'
File.foreach(dict) do |word|
  word.chomp!
  huge_list_of_words[word] = word if word.size > 4 # avoid finding `if` or `word`
end

1_000_000.times do
  'ifdxawesome45someword3'.contains_word_from_dict?(huge_list_of_words)
end

It returned after 22s on my slowish laptop.

To be very honest, I don't understand how change_all works, and what its purpose is. It does seem to work fine for your purpose though! ¯\_(ツ)_/¯

like image 133
Eric Duminil Avatar answered Oct 25 '22 08:10

Eric Duminil


A benchmark between the Trie gem and Triez gem in this particular use case:

word count: 228982
                   user     system      total        real
trie          13.410000   0.050000  13.460000 ( 13.463473)
triez         11.080000   0.010000  11.090000 ( 11.102195)
trie tail     39.920000   0.140000  40.060000 ( 40.102285)
triez tail    28.960000   0.030000  28.990000 ( 29.022630)

Generally, Triez is faster for the Op's use case.

require 'triez'
require 'trie'
require 'benchmark'

DICT = '/usr/share/dict/web2'

triez = Triez.new value_type: :object, default: nil
trie = Trie.new

count = 0
File.foreach(DICT) do |word|
  word.chomp!
  if word.size > 4
    triez[word] = word
    trie.add word
    count += 1
  end
end

puts "word count: #{count}"

def in_trie?(str, trie)
  0.upto(str.length - 1) do |i|
    node = trie.root
    i.upto(str.length - 1) do |j|
      break unless node.walk! str[j]
      if node.terminal?
        return str[i..j]
      end
    end
  end
  nil
end

def in_triez?(str, triez)
  triez.change_all(:substring, str) do |v|
    return v if v
  end
  nil
end

Benchmark.bm(12) do |b|
  b.report('trie') do
    1_000_000.times { in_trie?('ifdxawesome45someword3', trie) }
  end
  b.report('triez') do
    1_000_000.times { in_triez?('ifdxawesome45someword3', triez) }
  end
  b.report('trie tail') do
    1_000_000.times { in_trie?('ifdx45someword3awesome', trie) }
  end
  b.report('triez tail') do
    1_000_000.times { in_triez?('ifdx45someword3awesome', triez) }
  end
end

UPDATE benchmark for rambling-trie, where the lines with c prefix is the compressed version. (NOTE: the ROUND has been reduced to 100K rather than 1M in the prefix benchmark)

Word count: 228982, ROUND: 100000
                      user     system      total        real
trie              1.510000   0.000000   1.510000 (  1.511772)
triez             1.170000   0.000000   1.170000 (  1.176075)
rambling          4.800000   0.010000   4.810000 (  4.847021)
c rambling       25.060000   0.050000  25.110000 ( 25.172771)
trie tail         4.540000   0.010000   4.550000 (  4.566233)
triez tail        3.080000   0.010000   3.090000 (  3.092655)
rambling tail     4.780000   0.010000   4.790000 (  4.803114)
c rambling tail  23.470000   0.020000  23.490000 ( 23.525066)

It seems rambling-trie is implemented purely in Ruby, and it doesn't offer direct methods to do prefix matching. The following monkey patches need to be added first. There may be better implementation, but I didn't dig further.

class Rambling::Trie::Container
  def match_prefix?(str)
    root.match_prefix?(str.chars)
  end
end

class Rambling::Trie::RawNode
  def match_prefix?(chars, i = 0)
    if children_tree.empty?
      true
    elsif i >= chars.size
      false
    else
      letter = chars[i].to_sym
      child = children_tree[letter]
      !!child && child.match_prefix?(chars, i + 1)
    end
  end
end

class Rambling::Trie::CompressedNode
  def match_prefix?(chars)
    if children_tree.empty?
      true
    if chars.empty?
      false
    else
      !!(recursive_get :match_prefix?, chars)
    end
  end
end

def in_r_trie?(str, r_trie)
  0.upto(str.length - 1) do |i|
    if r_trie.match_prefix? str[i..-1]
      return true
    end
  end
  false
end
like image 3
Arie Xiao Avatar answered Oct 25 '22 06:10

Arie Xiao


The fastesy way is to create a trie for all the words in the huge list. Then you need loop through all the long strings. For a particular string, do custom iteration to start searching the trie for each letter in the string to match all your words. You can terminate early if you only need to find one contained word for extra speed.

For the trie part there seems to be a sweet implementation in ruby here. It even has a example, very similar to your use case that you should be able to tweak.

word = 'forestry'
node = trie.root

word.split('').each do |char|
  break unless node.walk!(char)
  if node.terminal?
    puts "Found me a word: #{node.full_state}"
  end
end
like image 3
vidstige Avatar answered Oct 25 '22 07:10

vidstige