I have this string:
%{Children^10 Health "sanitation management"^5}
And I want to convert it to tokenize this into an array of hashes:
[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]
I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both.
Any pointers?
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.
Steps to tokenize assets: Selecting an asset to tokenize, creating a tokenomics model, choosing a blockchain platform for asset tokenization, developing smart contracts, crypto wallet integration, token launch for trading on primary and secondary markets.
You can tokenize source code using a lexical analyzer (or lexer, for short) like flex (under C) or JLex (under Java). The easiest way to get grammars to tokenize Java, C, and C++ may be to use (subject to licensing terms) the code from an open source compiler using your favorite lexer.
The C function strtok() is a string tokenization function that takes two arguments: an initial string to be parsed and a const -qualified character delimiter. It returns a pointer to the first character of a token or to a null pointer if there is no token.
For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:
irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
{ :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]
If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.
A quick breakdown of the regex:
\w+
matches any single-term keywords(?:\\.|[^\\"]])*
uses non-capturing parentheses ((?:...)
) to match the contents of an escaped double quoted string - either an escaped symbol (\n
, \"
, \\
, etc.) or any single character that's not an escape symbol or an end quote."((?:\\.|[^\\"]])*)"
captures only the contents of a quoted keyword phrase.(?:(\w+)|"((?:\\.|[^\\"])*)")
matches any keyword - single term or phrase, capturing single terms into $1
and phrase contents into $2
\d+
matches a number.\^(\d+)
captures a number following a caret (^
). Since this is the third set of capturing parentheses, it will be caputred into $3
.(?:\^(\d+))?
captures a number following a caret if it's there, matches the empty string otherwise.String#scan(regex)
matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1
becomes match[0]
, $2
becomes match[1]
, etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil
entry in the resulting "match".
The #map
then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match
), and then creates your desired hashes. Exactly one of word
or phrase
will be nil
, since both can't be matched against the input, so (word || phrase)
will return the non-nil
one, and #downcase
will convert it to all lowercase. boost.to_i
will convert a string to an integer while (boost.nil? ? nil : boost.to_i)
will ensure that nil
boosts stay nil
.
Here is a non-robust example using StringScanner
. This is code I just adapted from Ruby Quiz: Parsing JSON, which has an excellent explanation.
require 'strscan'
def test_parse
text = %{Children^10 Health "sanitation management"^5}
expected = [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]
assert_equal(expected, parse(text))
end
def parse(text)
@input = StringScanner.new(text)
output = []
while keyword = parse_string || parse_quoted_string
output << {
:keywords => keyword,
:boost => parse_boost
}
trim_space
end
output
end
def parse_string
if @input.scan(/\w+/)
@input.matched.downcase
else
nil
end
end
def parse_quoted_string
if @input.scan(/"/)
str = parse_quoted_contents
@input.scan(/"/) or raise "unclosed string"
str
else
nil
end
end
def parse_quoted_contents
@input.scan(/[^\\"]+/) and @input.matched
end
def parse_boost
if @input.scan(/\^/)
boost = @input.scan(/\d+/)
raise 'missing boost value' if boost.nil?
boost.to_i
else
nil
end
end
def trim_space
@input.scan(/\s+/)
end
What you have here is an arbitrary grammar, and to parse it what you really want is a lexer - you can write a grammar file that described your syntax and then use the lexer to generate a recursive parser from your grammar.
Writing a lexer (or even a recursive parser) is not really trivial - although it is a useful exercise in programming - but you can find a list of Ruby lexers/parsers in this email message here: http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.ruby/2005-11/msg02233.html
RACC is available as a standard module of Ruby 1.8, so I suggest you concentrate on that even if its manual is not really easy to follow and it requires familiarity with yacc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With