I have a problem where I am trying to search for a substring in string. That substring may or may not be in the string.
str = "hello how are you?"
substr = "how are"
Two ways which I know if can be done are:
string.indexOf("how are")
But, is there any other "optimized" way? What would you have done?
Can Ruby provide a better answer? Since we use jRuby the answer can be in Ruby or Java.
In Ruby, use the String#include?
method:
str = "hello how are you?"
substr = "how are"
str.include? substr
which returns true
.
For an overview of "other ways", you can start with the "String-searching algorithm" article on Wikipedia.
Indexing strings using "Substring index" is one very obvious way of speeding things up, as mentioned by Martin, which only is appropriate if you are doing several searches over the same string:
"What would you have done?"
I'd do a benchmark and try comparing different ways of accomplishing the same thing to learn which is fastest.
In older versions of Ruby we'd see the regex-based searches running more slowly. The new engine in 1.9.2, which I'm using for the benchmark, makes a big difference. In particular, unanchored searches used to be a lot slower than anchored. Now it's a wash whether you use regex or a fixed string searches for the most part. Using match() without precompiling the regex is a painful hit for speed so if you're doing a lot of loops with the same pattern then it makes sense to assign the pattern to a variable and reference the variable.
The times displayed are how long each test took to perform "n" (750,000) iterations, so lower numbers are better.
require 'benchmark'
LOREM = %q{Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut et convallis purus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Cras interdum nibh et nunc pellentesque vestibulum. Donec elementum felis malesuada urna vehicula consectetur commodo urna accumsan. Phasellus bibendum euismod tincidunt. Sed pellentesque cursus faucibus. Etiam bibendum tincidunt nibh eget ultrices. Fusce imperdiet, felis id consequat imperdiet, justo est ultrices elit, sed vestibulum dui nibh vel felis. Sed feugiat, libero quis consequat semper, magna tellus facilisis enim, rutrum adipiscing eros mauris commodo metus. Sed lobortis aliquet augue ac sodales. Quisque pharetra odio vel augue tempus porttitor.}
REGEX1 = %r{/porttitor\.$/}
REGEX2 = %r{/porttitor\./}
REGEX3 = %r{/porttitor\.\Z/}
n = 750_000
puts "word in string"
Benchmark.bm(15) do |x|
x.report('string[""]:') { n.times { LOREM['porttitor.'] } }
x.report('string[//]:') { n.times { LOREM[/porttitor\./] } } # unanchored regex
x.report('string[/$/]:') { n.times { LOREM[/porttitor\.$/] } } # anchored regex
x.report('string[/\Z/]:') { n.times { LOREM[/porttitor\.\Z/] } } # anchored regex
x.report('index():') { n.times { LOREM.index('porttitor.') } }
x.report('include?():') { n.times { LOREM.include?('porttitor.') } }
x.report('match($):') { n.times { LOREM.match(/porttitor\.$/) } }
x.report('match(\Z):') { n.times { LOREM.match(/porttitor\.\Z/) } }
x.report('match():') { n.times { LOREM.match(/porttitor\./) } }
x.report('match2($):') { n.times { LOREM.match(REGEX1) } } # compiled regex w/ anchor
x.report('match2():') { n.times { LOREM.match(REGEX2) } } # compiled report w/out anchor
x.report('match2(\Z):') { n.times { LOREM.match(REGEX3) } } # compiled regex w/ anchor
end
puts
puts "word not in string"
Benchmark.bm(15) do |x|
x.report('string[""]:') { n.times { LOREM['porttit0r.'] } }
x.report('string[//]:') { n.times { LOREM[/porttit0r\./] } } # unanchored regex
x.report('string[/$/]:') { n.times { LOREM[/porttit0r\.$/] } } # anchored regex
x.report('string[/\Z/]:') { n.times { LOREM[/porttit0r\.\Z/] } } # anchored regex
x.report('index():') { n.times { LOREM.index('porttit0r.') } }
x.report('include?():') { n.times { LOREM.include?('porttit0r.') } }
x.report('match($):') { n.times { LOREM.match(/porttit0r\.$/) } }
x.report('match(\Z):') { n.times { LOREM.match(/porttit0r\.\Z/) } }
x.report('match():') { n.times { LOREM.match(/porttit0r\./) } }
end
With the output:
word in string
user system total real
string[""]: 0.670000 0.000000 0.670000 ( 0.675319)
string[//]: 0.700000 0.000000 0.700000 ( 0.706148)
string[/$/]: 0.720000 0.000000 0.720000 ( 0.716853)
string[/\Z/]: 0.530000 0.000000 0.530000 ( 0.527568)
index(): 0.630000 0.000000 0.630000 ( 0.638562)
include?(): 0.610000 0.000000 0.610000 ( 0.603223)
match($): 1.690000 0.000000 1.690000 ( 1.696045)
match(\Z): 1.520000 0.010000 1.530000 ( 1.532107)
match(): 1.700000 0.000000 1.700000 ( 1.698748)
match2($): 0.840000 0.000000 0.840000 ( 0.847590)
match2(): 0.840000 0.000000 0.840000 ( 0.840969)
match2(\Z): 0.840000 0.000000 0.840000 ( 0.835557)
word not in string
user system total real
string[""]: 0.570000 0.000000 0.570000 ( 0.578120)
string[//]: 0.740000 0.000000 0.740000 ( 0.734751)
string[/$/]: 0.730000 0.000000 0.730000 ( 0.735599)
string[/\Z/]: 0.560000 0.000000 0.560000 ( 0.563673)
index(): 0.620000 0.000000 0.620000 ( 0.619451)
include?(): 0.570000 0.000000 0.570000 ( 0.574413)
match($): 0.910000 0.010000 0.920000 ( 0.910059)
match(\Z): 0.730000 0.000000 0.730000 ( 0.726533)
match(): 0.950000 0.000000 0.950000 ( 0.960865)
For reference, here's some numbers using Ruby 1.8.7, which is the default for Snow Leopard:
word in string
user system total real
string[""]: 1.130000 0.000000 1.130000 ( 1.130687)
string[//]: 1.170000 0.000000 1.170000 ( 1.165692)
string[/$/]: 1.180000 0.000000 1.180000 ( 1.184954)
string[/\Z/]: 1.180000 0.000000 1.180000 ( 1.179168)
index(): 1.070000 0.000000 1.070000 ( 1.077791)
include?(): 1.060000 0.000000 1.060000 ( 1.056430)
match($): 1.470000 0.010000 1.480000 ( 1.472797)
match(\Z): 1.480000 0.000000 1.480000 ( 1.490172)
match(): 1.480000 0.000000 1.480000 ( 1.478146)
match2($): 0.650000 0.000000 0.650000 ( 0.653029)
match2(): 0.570000 0.000000 0.570000 ( 0.574384)
match2(\Z): 0.640000 0.000000 0.640000 ( 0.646688)
word not in string
user system total real
string[""]: 1.040000 0.000000 1.040000 ( 1.038885)
string[//]: 0.510000 0.000000 0.510000 ( 0.507031)
string[/$/]: 0.510000 0.000000 0.510000 ( 0.508425)
string[/\Z/]: 0.500000 0.000000 0.500000 ( 0.507316)
index(): 1.060000 0.000000 1.060000 ( 1.055157)
include?(): 1.030000 0.000000 1.030000 ( 1.037060)
match($): 0.630000 0.000000 0.630000 ( 0.623627)
match(\Z): 0.620000 0.000000 0.620000 ( 0.624737)
match(): 0.620000 0.000000 0.620000 ( 0.623049)
I added additional tests to give some ideas of the effects of using only unanchored and anchored regex:
require 'fruity'
LOREM = %{Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut et convallis purus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Cras interdum nibh et nunc pellentesque vestibulum. Donec elementum felis malesuada urna vehicula consectetur commodo urna accumsan. Phasellus bibendum euismod tincidunt. Sed pellentesque cursus faucibus. Etiam bibendum tincidunt nibh eget ultrices. Fusce imperdiet, felis id consequat imperdiet, justo est ultrices elit, sed vestibulum dui nibh vel felis. Sed feugiat, libero quis consequat semper, magna tellus facilisis enim, rutrum adipiscing eros mauris commodo metus. Sed lobortis aliquet augue ac sodales. Quisque pharetra odio vel augue tempus porttitor.}
compare do
str_slice_regex { LOREM[/porttitor\./] } # unanchored regex
str_slice_dollar { LOREM[/porttitor\.$/] } # anchored regex
str_slice_ctrlZ { LOREM[/porttitor\.\Z/] } # anchored regex
str_slice_ctrlz { LOREM[/porttitor\.\z/] } # anchored regex
end
# >> Running each test 8192 times. Test will take about 1 second.
# >> str_slice_ctrlz is similar to str_slice_ctrlZ
# >> str_slice_ctrlZ is faster than str_slice_regex by 2x ± 0.1
# >> str_slice_regex is similar to str_slice_dollar
This is using Fruity, so the results aren't directly correlated with the information above, but it's still useful.
Here's some updated information:
# >> Running on Ruby v.2.7.0
# >> word in string
# >> user system total real
# >> string[""]: 0.368283 0.000147 0.368430 ( 0.368468)
# >> string[//]: 0.329253 0.000080 0.329333 ( 0.329466)
# >> string[/$/]: 0.330270 0.000172 0.330442 ( 0.330594)
# >> string[/\Z/]: 0.183119 0.000048 0.183167 ( 0.183209)
# >> index(): 0.358397 0.000289 0.358686 ( 0.360185)
# >> include?(): 0.352700 0.000196 0.352896 ( 0.353056)
# >> match($): 0.761605 0.001502 0.763107 ( 0.763297)
# >> match(\Z): 0.631132 0.000507 0.631639 ( 0.631767)
# >> match(): 0.765219 0.000634 0.765853 ( 0.766199)
# >> match2($): 0.394938 0.000128 0.395066 ( 0.395173)
# >> match2(): 0.391687 0.000080 0.391767 ( 0.391879)
# >> match2(\Z): 0.389440 0.000089 0.389529 ( 0.389678)
# >>
# >> word not in string
# >> user system total real
# >> string[""]: 0.365097 0.000117 0.365214 ( 0.365262)
# >> string[//]: 0.388117 0.000361 0.388478 ( 0.389008)
# >> string[/$/]: 0.381933 0.000091 0.382024 ( 0.382061)
# >> string[/\Z/]: 0.236101 0.000086 0.236187 ( 0.236307)
# >> index(): 0.369898 0.000131 0.370029 ( 0.370267)
# >> include?(): 0.361057 0.000065 0.361122 ( 0.361202)
# >> match($): 0.409296 0.000390 0.409686 ( 0.410103)
# >> match(\Z): 0.258903 0.000074 0.258977 ( 0.259095)
# >> match(): 0.399220 0.000070 0.399290 ( 0.399386)
# >> --------------------
# >> Running each test 16384 times. Test will take about 1 second.
# >> str_slice_ctrlz is similar to str_slice_ctrlZ
# >> str_slice_ctrlZ is faster than str_slice_dollar by 2x ± 0.1
# >> str_slice_dollar is similar to str_slice_regex
"Finding if a sentence contains a specific phrase in Ruby" is related.
If you want only check if a substring is in string, you can use: str[substr]
.
It returns substring or nil.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With