Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract values from a text body in Ruby

I need to extract some values from a multi-line string (which I read from the text body of emails). I want to be able to feed patterns to my parser so I can customize different emails later. I came up with the following:

#!/usr/bin/env ruby

text1 = 
<<-eos
Lorem ipsum dolor sit amet, 

Name: Pepe Manuel Periquita

Email: [email protected]

Sisters: 1
Brothers: 3
Children: 2

Lorem ipsum dolor sit amet
eos

pattern1 = {
  :exp => /Name:[\s]*(.*?)$\s*
          Email:[\s]*(.*?)$\s*
          Sisters:[\s]*(.*?)$\s*
          Brothers:[\s]*(.*?)$\s*
          Children:[\s]*(.*?)$/mx,
  :blk => lambda do |m|
    m.flatten!
    {:name => m[0],
     :email => m[1],
     :total => m.drop(2).inject(0){|sum,item| sum + item.to_i}}
  end
}

# Scan on text returns 
#[["Pepe Manuel Periquita", "[email protected]", "1", "3", "2"]]

  def do_parse text, pattern
    data = pattern[:blk].call(text.scan(pattern[:exp]))

    puts data.inspect
  end


do_parse text1, pattern1

# ./text_parser.rb
# {:email=>"[email protected]", :total=>6, :name=>"Pepe Manuel Periquita"}

So, I define the pattern as a regular expression paired with a block to build a hash from the matches. The "parser" simply takes the text and apply the rules by executing the block on the result of matching the regular expression against the text with scan.

At the moment I have to parse emails with a format as shown in text1 but later I would like to add patterns as easily as possible to extract data from different emails (the format of those emails will be fixed for each type). Therefore I would like to simplify the pattern moving as much as possible to the "parser". The code above works and extracts the data but most of the work is located at the pattern...

Is this is the right way to go?

Could be simplified or would you think a different / better solution for this problem?

Update

I updated the parser following Tonttu solution so the pattern hash is now:

pattern2 = {
  :exp => /^(.+?):\s*(.+)$/,
  :blk => lambda do |m|
    r = Hash[m.map{|x| [x[0].downcase.to_sym, x[1]]}]

    {:name => r[:name],
     :email => r[:email],
     :total => r[:children].to_i + r[:brothers].to_i + r[:sisters].to_i}
  end
}
like image 310
Miquel Avatar asked Jan 23 '26 05:01

Miquel


1 Answers

Maybe something like this is generic enough?

pp Hash[*text1.scan(/^(.+?):\s(.+)$/).map{|x|
     [x[0].downcase.to_sym, x[1]]
   }.flatten]

=>
{:sisters=>"1",
 :brothers=>"3",
 :children=>"2",
 :name=>"Pepe Manuel Periquita",
 :email=>"[email protected]"}
like image 180
Tonttu Avatar answered Jan 25 '26 23:01

Tonttu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!