Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Newline positions

Tags:

string

ruby

Given a string, what is the most efficient way to return an array of the character position of the beginning of newlines in the string?

text =<<_
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in 
culpa qui officia deserunt mollit anim id est laborum.
_

Expected:

find_newlines(text) # => [0, 80, 155, 233, 313, 393]

I post my own answers. I would like to accept the fastest way as the accepted answer.


Benchmark result here will be updated when a new answer is added

require "fruity"

compare do
  padde1 {find_newlines_padde1(text)}
  digitalross1 {find_newlines_digitalross1(text)}
  sawa1 {find_newlines1(text)}
  sawa2 {find_newlines2(text)}
end

# Running each test 512 times. Test will take about 1 second.
# digitalross1 is faster than sawa2 by 5x ± 0.1
# sawa2 is faster than sawa1 by 21.999999999999996% ± 1.0%
# sawa1 is faster than padde1 by 4.0000000000000036% ± 1.0%
like image 533
sawa Avatar asked Feb 18 '13 16:02

sawa


2 Answers

def find_newlines text
  s = 0
  [0] + text.to_a[0..-2].map { |e| s += e.size }
end

As noted, use text.each_line.to_a for 1.9. Calling each_line also works in 1.8.7, but it's 20% slower than calling only to_a.

like image 72
DigitalRoss Avatar answered Sep 20 '22 13:09

DigitalRoss


Similar to your answer:

def find_newlines_padde1 text
  text.enum_for(:scan, /^/).map do
    $~.begin(0)
  end
end

You can still gain some performance with rubyinline:

require "inline"
module Kernel
  inline :C do |builder|
    builder.add_compile_flags '-std=c99'
    builder.c %q{
      static VALUE find_newlines_padde2(VALUE str) {
        char newline = '\n';
        char* s = RSTRING_PTR(str);
        VALUE res = rb_ary_new();
        str = StringValue(str);
        rb_ary_push(res, LONG2FIX(0));
        for (long pos=0; pos<RSTRING_LEN(str)-1; pos++) {
          if (s[pos] == newline) {
             rb_ary_push(res, LONG2FIX(pos+1));
          }
        }
        return res;
      }
    }
  end
end

Note that i'm artificially ending early with pos<RSTRING_LEN(str)-1 to get the same result you requested. You can change this to pos<RSTRING_LEN(str) if you like, so the last empty line counts as a line start, too. You will have decide which one works for you.

Fruity says padde2 is faster than sawa2 by 22x ± 0.1

like image 20
Patrick Oscity Avatar answered Sep 19 '22 13:09

Patrick Oscity