Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby on Rails regexp equals-tilde vs. array include for checking a list of options

I'm using Rails 3.2.3 with Ruby 1.9.3p0.

I'm finding that I often have to determine whether a string occurs in a list of options. It seems I can use the Ruby array .include method:

<% if ['todo','pending','history'].include?(params[:category]) %>

or the regular expression equals-tilde match shorthand with vertical bars separating options:

<% if params[:category] =~ /todo|pending|history/ %>

Is one better than the other in terms of performance?

Is there an even better approach?

like image 950
Mark Berry Avatar asked Aug 10 '12 16:08

Mark Berry


2 Answers

Summary: Array#include? with String elements wins for both accepted and rejected inputs, for your example with only three acceptable values. For a larger set to check, it looks like Set#include? with String elements might win.


How to Test

We should test this is empirically.

Here are a couple of alternatives that you may want to consider as well: pre-compiled regex, a list of symbols, and a Set with String elements.

I would imagine that the performance may also depend on whether most of your inputs fall into the expected set, and are accepted, or whether most are outside the set, and are rejected.

Here's an empirical test script:

require 'benchmark'
require 'set'

strings = ['todo','pending','history']
string_set = Set.new(strings)
symbols = strings.map(&:to_sym)
regex_compiled = Regexp.new(strings.join("|"))

strings_avg_size = (strings.map(&:size).inject {|sum, n| sum + n}.to_f / strings.size).to_i
num_inputs = 1_000_000

accepted_inputs = (0...num_inputs).map { strings[rand(strings.size)] } 
rejected_inputs = (0...num_inputs).map { (0..strings_avg_size).map { ('a'...'z').to_a[rand(26)] }.join }

Benchmark.bmbm(40) do |x|
  x.report("Array#include?, Strings, accepted:") { accepted_inputs.map {|s| strings.include?(s) } }
  x.report("Array#include?, Strings, rejected:") { rejected_inputs.map {|s| strings.include?(s) } }
  x.report("Array#include?, Symbols, accepted:") { accepted_inputs.map {|s| symbols.include?(s.to_sym) } }
  x.report("Array#include?, Symbols, rejected:") { rejected_inputs.map {|s| symbols.include?(s.to_sym) } }
  x.report("Set#include?, Strings, accepted:") { accepted_inputs.map {|s| string_set.include?(s) } }
  x.report("Set#include?, Strings, rejected:") { rejected_inputs.map {|s| string_set.include?(s) } }
  x.report("Regexp#match, interpreted, accepted:") { accepted_inputs.map {|s| s =~ /todo|pending|history/ } }
  x.report("Regexp#match, interpreted, rejected:") { rejected_inputs.map {|s| s =~ /todo|pending|history/ } }
  x.report("Regexp#match, compiled, accepted:") { accepted_inputs.map {|s| regex_compiled.match(s) } }
  x.report("Regexp#match, compiled, rejected:") { rejected_inputs.map {|s| regex_compiled.match(s) } }
end

Results

Rehearsal ---------------------------------------------------------------------------
Array#include?, Strings, accepted:        0.210000   0.000000   0.210000 (  0.215099)
Array#include?, Strings, rejected:        0.530000   0.010000   0.540000 (  0.543898)
Array#include?, Symbols, accepted:        0.330000   0.000000   0.330000 (  0.337767)
Array#include?, Symbols, rejected:        1.870000   0.050000   1.920000 (  1.923155)
Set#include?, Strings, accepted:          0.270000   0.000000   0.270000 (  0.274774)
Set#include?, Strings, rejected:          0.460000   0.000000   0.460000 (  0.463925)
Regexp#match, interpreted, accepted:      0.380000   0.000000   0.380000 (  0.382060)
Regexp#match, interpreted, rejected:      0.650000   0.000000   0.650000 (  0.660775)
Regexp#match, compiled, accepted:         1.130000   0.080000   1.210000 (  1.220970)
Regexp#match, compiled, rejected:         0.630000   0.000000   0.630000 (  0.640721)
------------------------------------------------------------------ total: 6.600000sec

                                              user     system      total        real
Array#include?, Strings, accepted:        0.210000   0.000000   0.210000 (  0.219060)
Array#include?, Strings, rejected:        0.430000   0.000000   0.430000 (  0.444911)
Array#include?, Symbols, accepted:        0.340000   0.000000   0.340000 (  0.341970)
Array#include?, Symbols, rejected:        1.080000   0.000000   1.080000 (  1.089961)
Set#include?, Strings, accepted:          0.270000   0.000000   0.270000 (  0.281270)
Set#include?, Strings, rejected:          0.400000   0.000000   0.400000 (  0.406181)
Regexp#match, interpreted, accepted:      0.370000   0.000000   0.370000 (  0.366931)
Regexp#match, interpreted, rejected:      0.560000   0.000000   0.560000 (  0.558652)
Regexp#match, compiled, accepted:         0.920000   0.000000   0.920000 (  0.915914)
Regexp#match, compiled, rejected:         0.620000   0.000000   0.620000 (  0.627620)

Conclusions

(see the Summary above)

It makes sense to me, upon reflection, that the Array of Symbols would be very slow for rejected inputs, because every single one of those random strings must be interned in the Symbol table before the check can be made.

It makes less sense to me, even after pondering, that the compiled Regexp would perform so badly, especially compared to the Regexp interpreted as a literal in the code. Can anyone explain why it does so badly?

like image 52
ms-tg Avatar answered Oct 13 '22 20:10

ms-tg


@ms-tg's answer has good benchmarks, and answers your question well as far as I'm concerned. I just wanted to add a small note: be careful with this because these two options won't always have the same results:

params = Hash.new
keyword_array = ['todo','pending','history']  
included = nil

params[:category] = "history plus other text" 

start_time = Time.now
1000.times do
   included = keyword_array.include?(params[:category])
end
puts "Array.include? returned #{included} in #{(Time.now - start_time)*1000}ms"    
start_time = Time.now
1000.times do
   included = (params[:category] =~ /todo|pending|history/).is_a?(Integer) 
end
puts "Regexp returned #{included} in #{(Time.now - start_time)*1000}ms"  

Returns:

Array.include? returned false in 0.477ms

Regexp returned true in 0.953ms

Notice that the regular expression returned true in this case, but the array.include? returned false. This should be considered when building your logic.

Basically, if the string isn't in the array exactly, array.include? will be false, but if one of the keywords is anywhere in the string the regular expression will be true (regardless of whether there's other text).

like image 24
DRobinson Avatar answered Oct 13 '22 21:10

DRobinson