I'm looking to speed along my discovery process here quite a bit, as this is my first venture into the world of lexical analysis. Maybe this is even the wrong path. First, I'll describe my problem:
I've got very large properties files (in the order of 1,000 properties), which when distilled, are really just about 15 important properties and the rest can be generated or rarely ever change.
So, for example:
general {
name = myname
ip = 127.0.0.1
}
component1 {
key = value
foo = bar
}
This is the type of format I want to create to tokenize something like:
property.${general.name}blah.home.directory = /blah
property.${general.name}.ip = ${general.ip}
property.${component1}.ip = ${general.ip}
property.${component1}.foo = ${component1.foo}
into
property.mynameblah.home.directory = /blah
property.myname.ip = 127.0.0.1
property.component1.ip = 127.0.0.1
property.component1.foo = bar
Lexical analysis and tokenization sounds like my best route, but this is a very simple form of it. It's a simple grammar, a simple substitution and I'd like to make sure that I'm not bringing a sledgehammer to knock in a nail.
I could create my own lexer and tokenizer, or ANTlr is a possibility, but I don't like re-inventing the wheel and ANTlr sounds like overkill.
I'm not familiar with compiler techniques, so pointers in the right direction & code would be most appreciated.
Note: I can change the input format.
Lexical analyzer separates the characters of the source language into groups that logically belong together, called tokens. It includes the token name which is an abstract symbol that define a type of lexical unit and an optional attribute value called token values.
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the High level input program into a sequence of Tokens. Lexical Analysis can be implemented with the Deterministic finite Automata. The output is a sequence of tokens that is sent to the parser for syntax analysis.
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of lexical tokens (strings with an assigned and thus identified meaning).
There's an excellent article on Using Regular Expressions for Lexical Analysis at effbot.org.
Adapting the tokenizer to your problem:
import re
token_pattern = r"""
(?P<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
|(?P<integer>[0-9]+)
|(?P<dot>\.)
|(?P<open_variable>[$][{])
|(?P<open_curly>[{])
|(?P<close_curly>[}])
|(?P<newline>\n)
|(?P<whitespace>\s+)
|(?P<equals>[=])
|(?P<slash>[/])
"""
token_re = re.compile(token_pattern, re.VERBOSE)
class TokenizerException(Exception): pass
def tokenize(text):
pos = 0
while True:
m = token_re.match(text, pos)
if not m: break
pos = m.end()
tokname = m.lastgroup
tokvalue = m.group(tokname)
yield tokname, tokvalue
if pos != len(text):
raise TokenizerException('tokenizer stopped at pos %r of %r' % (
pos, len(text)))
To test it, we do:
stuff = r'property.${general.name}.ip = ${general.ip}'
stuff2 = r'''
general {
name = myname
ip = 127.0.0.1
}
'''
print ' stuff '.center(60, '=')
for tok in tokenize(stuff):
print tok
print ' stuff2 '.center(60, '=')
for tok in tokenize(stuff2):
print tok
for:
========================== stuff ===========================
('identifier', 'property')
('dot', '.')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'name')
('close_curly', '}')
('dot', '.')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'ip')
('close_curly', '}')
========================== stuff2 ==========================
('newline', '\n')
('identifier', 'general')
('whitespace', ' ')
('open_curly', '{')
('newline', '\n')
('whitespace', ' ')
('identifier', 'name')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('identifier', 'myname')
('newline', '\n')
('whitespace', ' ')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('integer', '127')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '1')
('newline', '\n')
('close_curly', '}')
('newline', '\n')
A simple DFA works well for this. You only need a few states:
${
${
looking for at least one valid character forming the name}
.If the properties file is order agnostic, you might want a two pass processor to verify that each name resolves correctly.
Of course, you then need to write the substitution code, but once you have a list of all the names used, the simplest possible implementation is a find/replace on ${name}
with its corresponding value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With