I have this string: <pre class="prettyprint"><code>%{Children^10 Health "sanitation management"^5} </code></pre> And I want to convert it to tokenize this into an array of hashes: <pre class="prettyprint"><code>[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}] </code></pre> I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both. Any pointers?

For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack: <pre class="prettyprint"><code>irb> text = %{Children^10 Health "sanitation management"^5} irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost| { :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) } end #=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}] </code></pre> If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular. A quick breakdown of the regex: <ul> <li> <code>\w+</code> matches any single-term keywords</li> <li> <code>(?:\\.|[^\\"]])*</code> uses non-capturing parentheses (<code>(?:...)</code>) to match the contents of an escaped double quoted string - either an escaped symbol (<code>\n</code>, <code>\"</code>, <code>\\</code>, etc.) or any single character that's not an escape symbol or an end quote.</li> <li> <code>"((?:\\.|[^\\"]])*)"</code> captures only the contents of a quoted keyword phrase.</li> <li> <code>(?:(\w+)|"((?:\\.|[^\\"])*)")</code> matches any keyword - single term or phrase, capturing single terms into <code>$1</code> and phrase contents into <code>$2</code> </li> <li> <code>\d+</code> matches a number.</li> <li> <code>\^(\d+)</code> captures a number following a caret (<code>^</code>). Since this is the third set of capturing parentheses, it will be caputred into <code>$3</code>.</li> <li> <code>(?:\^(\d+))?</code> captures a number following a caret if it's there, matches the empty string otherwise.</li> </ul> <code>String#scan(regex)</code> matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so <code>$1</code> becomes <code>match[0]</code>, <code>$2</code> becomes <code>match[1]</code>, etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a <code>nil</code> entry in the resulting "match". The <code>#map</code> then takes these matches, uses some block magic to break each captured term into different variables (we could have done <code>do |match| ; word,phrase,boost = *match</code>), and then creates your desired hashes. Exactly one of <code>word</code> or <code>phrase</code> will be <code>nil</code>, since both can't be matched against the input, so <code>(word || phrase)</code> will return the non-<code>nil</code> one, and <code>#downcase</code> will convert it to all lowercase. <code>boost.to_i</code> will convert a string to an integer while <code>(boost.nil? ? nil : boost.to_i)</code> will ensure that <code>nil</code> boosts stay <code>nil</code>.

How do I tokenize this string in Ruby?

Tags:

parsing

ruby

text-parsing

tokenize

I have this string:

%{Children^10 Health "sanitation management"^5}

And I want to convert it to tokenize this into an array of hashes:

[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]

I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both.

Any pointers?

624

asked Apr 03 '09 11:04

Radamanthus

3 Answers

For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:

irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
       { :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
     end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]

If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.

A quick breakdown of the regex:

\w+ matches any single-term keywords
(?:\\.|[^\\"]])* uses non-capturing parentheses ((?:...)) to match the contents of an escaped double quoted string - either an escaped symbol (\n, \", \\, etc.) or any single character that's not an escape symbol or an end quote.
"((?:\\.|[^\\"]])*)" captures only the contents of a quoted keyword phrase.
(?:(\w+)|"((?:\\.|[^\\"])*)") matches any keyword - single term or phrase, capturing single terms into $1 and phrase contents into $2
\d+ matches a number.
\^(\d+) captures a number following a caret (^). Since this is the third set of capturing parentheses, it will be caputred into $3.
(?:\^(\d+))? captures a number following a caret if it's there, matches the empty string otherwise.

String#scan(regex) matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1 becomes match[0], $2 becomes match[1], etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil entry in the resulting "match".

The #map then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match), and then creates your desired hashes. Exactly one of word or phrase will be nil, since both can't be matched against the input, so (word || phrase) will return the non-nil one, and #downcase will convert it to all lowercase. boost.to_i will convert a string to an integer while (boost.nil? ? nil : boost.to_i) will ensure that nil boosts stay nil.

155

answered Sep 30 '22 10:09

rampion

Here is a non-robust example using StringScanner. This is code I just adapted from Ruby Quiz: Parsing JSON, which has an excellent explanation.

require 'strscan'

def test_parse
  text = %{Children^10 Health "sanitation management"^5}
  expected = [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]


  assert_equal(expected, parse(text))
end

def parse(text)
  @input = StringScanner.new(text)

  output = []

  while keyword = parse_string || parse_quoted_string
    output << {
      :keywords => keyword,
      :boost => parse_boost
    }
    trim_space
  end

  output
end

def parse_string
  if @input.scan(/\w+/)
    @input.matched.downcase
  else
    nil
  end
end

def parse_quoted_string
  if @input.scan(/"/)
    str = parse_quoted_contents
    @input.scan(/"/) or raise "unclosed string"
    str
  else
    nil
  end
end

def parse_quoted_contents
  @input.scan(/[^\\"]+/) and @input.matched
end

def parse_boost
  if @input.scan(/\^/)
    boost = @input.scan(/\d+/)
    raise 'missing boost value' if boost.nil?
    boost.to_i
  else
    nil
  end
end

def trim_space
  @input.scan(/\s+/)
end

answered Sep 30 '22 11:09

Aaron Hinni

What you have here is an arbitrary grammar, and to parse it what you really want is a lexer - you can write a grammar file that described your syntax and then use the lexer to generate a recursive parser from your grammar.

Writing a lexer (or even a recursive parser) is not really trivial - although it is a useful exercise in programming - but you can find a list of Ruby lexers/parsers in this email message here: http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.ruby/2005-11/msg02233.html

RACC is available as a standard module of Ruby 1.8, so I suggest you concentrate on that even if its manual is not really easy to follow and it requires familiarity with yacc.

answered Sep 30 '22 10:09

Guss

Related questions
                            
                                how to run phantomjs on heroku?
                            
                                How to make gem install command only install when not installed or update needed
                            
                                How to get the last element of a sequence in XPath?
                            
                                In Ruby, how to output json from hash and give it line breaks and tabs
                            
                                Adding created_at DESC functionality in Rails
                            
                                How is Ruby TCPSocket timeout defined?
                            
                                Using a gem in a pure ruby script (not Rails)
                            
                                Ruby, unique hashes in array based on multiple fields
                            
                                How to get error messages from ruby threads
                            
                                Python or Ruby for a .NET developer? [closed]
                            
                                RVM: How to use gems from a different ruby?
                            
                                how to output the current protocol and url using rails?
                            
                                Ruby : Invoke overridden method of parent class, in child class
                            
                                Why is this permissions error occurring with mod_passenger.so?
                            
                                Rvm: Cannot uninstall bundler 1.1.0
                            
                                Ruby VCR gem keeps recording the same requests
                            
                                Exclude option from collection.map in Ruby on Rails?
                            
                                Measure response time for a HTTP request using Ruby
                            
                                String interpolation in HTML attributes in an ERB file
                            
                                Convert array of strings to an array of integers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With