Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I tokenize this string in Ruby?

I have this string:

%{Children^10 Health "sanitation management"^5}

And I want to convert it to tokenize this into an array of hashes:

[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]

I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both.

Any pointers?

like image 624
Radamanthus Avatar asked Apr 03 '09 11:04

Radamanthus


People also ask

What is Tokenizing a string?

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.

How do you tokenize?

Steps to tokenize assets: Selecting an asset to tokenize, creating a tokenomics model, choosing a blockchain platform for asset tokenization, developing smart contracts, crypto wallet integration, token launch for trading on primary and secondary markets.

How do I use tokenize code?

You can tokenize source code using a lexical analyzer (or lexer, for short) like flex (under C) or JLex (under Java). The easiest way to get grammars to tokenize Java, C, and C++ may be to use (subject to licensing terms) the code from an open source compiler using your favorite lexer.

What does it mean to tokenize a string in C?

The C function strtok() is a string tokenization function that takes two arguments: an initial string to be parsed and a const -qualified character delimiter. It returns a pointer to the first character of a token or to a null pointer if there is no token.


3 Answers

For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:

irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
       { :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
     end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]

If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.

A quick breakdown of the regex:

  • \w+ matches any single-term keywords
  • (?:\\.|[^\\"]])* uses non-capturing parentheses ((?:...)) to match the contents of an escaped double quoted string - either an escaped symbol (\n, \", \\, etc.) or any single character that's not an escape symbol or an end quote.
  • "((?:\\.|[^\\"]])*)" captures only the contents of a quoted keyword phrase.
  • (?:(\w+)|"((?:\\.|[^\\"])*)") matches any keyword - single term or phrase, capturing single terms into $1 and phrase contents into $2
  • \d+ matches a number.
  • \^(\d+) captures a number following a caret (^). Since this is the third set of capturing parentheses, it will be caputred into $3.
  • (?:\^(\d+))? captures a number following a caret if it's there, matches the empty string otherwise.

String#scan(regex) matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1 becomes match[0], $2 becomes match[1], etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil entry in the resulting "match".

The #map then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match), and then creates your desired hashes. Exactly one of word or phrase will be nil, since both can't be matched against the input, so (word || phrase) will return the non-nil one, and #downcase will convert it to all lowercase. boost.to_i will convert a string to an integer while (boost.nil? ? nil : boost.to_i) will ensure that nil boosts stay nil.

like image 155
rampion Avatar answered Sep 30 '22 10:09

rampion


Here is a non-robust example using StringScanner. This is code I just adapted from Ruby Quiz: Parsing JSON, which has an excellent explanation.

require 'strscan'

def test_parse
  text = %{Children^10 Health "sanitation management"^5}
  expected = [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]


  assert_equal(expected, parse(text))
end

def parse(text)
  @input = StringScanner.new(text)

  output = []

  while keyword = parse_string || parse_quoted_string
    output << {
      :keywords => keyword,
      :boost => parse_boost
    }
    trim_space
  end

  output
end

def parse_string
  if @input.scan(/\w+/)
    @input.matched.downcase
  else
    nil
  end
end

def parse_quoted_string
  if @input.scan(/"/)
    str = parse_quoted_contents
    @input.scan(/"/) or raise "unclosed string"
    str
  else
    nil
  end
end

def parse_quoted_contents
  @input.scan(/[^\\"]+/) and @input.matched
end

def parse_boost
  if @input.scan(/\^/)
    boost = @input.scan(/\d+/)
    raise 'missing boost value' if boost.nil?
    boost.to_i
  else
    nil
  end
end

def trim_space
  @input.scan(/\s+/)
end
like image 29
Aaron Hinni Avatar answered Sep 30 '22 11:09

Aaron Hinni


What you have here is an arbitrary grammar, and to parse it what you really want is a lexer - you can write a grammar file that described your syntax and then use the lexer to generate a recursive parser from your grammar.

Writing a lexer (or even a recursive parser) is not really trivial - although it is a useful exercise in programming - but you can find a list of Ruby lexers/parsers in this email message here: http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.ruby/2005-11/msg02233.html

RACC is available as a standard module of Ruby 1.8, so I suggest you concentrate on that even if its manual is not really easy to follow and it requires familiarity with yacc.

like image 42
Guss Avatar answered Sep 30 '22 10:09

Guss