Given a string, what is the most efficient way to return an array of the character position of the beginning of newlines in the string?
text =<<_
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
_
Expected:
find_newlines(text) # => [0, 80, 155, 233, 313, 393]
I post my own answers. I would like to accept the fastest way as the accepted answer.
Benchmark result here will be updated when a new answer is added
require "fruity"
compare do
padde1 {find_newlines_padde1(text)}
digitalross1 {find_newlines_digitalross1(text)}
sawa1 {find_newlines1(text)}
sawa2 {find_newlines2(text)}
end
# Running each test 512 times. Test will take about 1 second.
# digitalross1 is faster than sawa2 by 5x ± 0.1
# sawa2 is faster than sawa1 by 21.999999999999996% ± 1.0%
# sawa1 is faster than padde1 by 4.0000000000000036% ± 1.0%
def find_newlines text
s = 0
[0] + text.to_a[0..-2].map { |e| s += e.size }
end
As noted, use text.each_line.to_a
for 1.9. Calling each_line
also works in 1.8.7, but it's 20% slower than calling only to_a.
Similar to your answer:
def find_newlines_padde1 text
text.enum_for(:scan, /^/).map do
$~.begin(0)
end
end
You can still gain some performance with rubyinline:
require "inline"
module Kernel
inline :C do |builder|
builder.add_compile_flags '-std=c99'
builder.c %q{
static VALUE find_newlines_padde2(VALUE str) {
char newline = '\n';
char* s = RSTRING_PTR(str);
VALUE res = rb_ary_new();
str = StringValue(str);
rb_ary_push(res, LONG2FIX(0));
for (long pos=0; pos<RSTRING_LEN(str)-1; pos++) {
if (s[pos] == newline) {
rb_ary_push(res, LONG2FIX(pos+1));
}
}
return res;
}
}
end
end
Note that i'm artificially ending early with pos<RSTRING_LEN(str)-1
to get the same result you requested. You can change this to pos<RSTRING_LEN(str)
if you like, so the last empty line counts as a line start, too. You will have decide which one works for you.
Fruity says padde2 is faster than sawa2 by 22x ± 0.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With