Python - lexical analysis and tokenization

Tags:

I'm looking to speed along my discovery process here quite a bit, as this is my first venture into the world of lexical analysis. Maybe this is even the wrong path. First, I'll describe my problem:

I've got very large properties files (in the order of 1,000 properties), which when distilled, are really just about 15 important properties and the rest can be generated or rarely ever change.

So, for example:

general {
  name = myname
  ip = 127.0.0.1
}

component1 {
   key = value
   foo = bar
}

This is the type of format I want to create to tokenize something like:

property.${general.name}blah.home.directory = /blah
property.${general.name}.ip = ${general.ip}
property.${component1}.ip = ${general.ip}
property.${component1}.foo = ${component1.foo}

into

property.mynameblah.home.directory = /blah
property.myname.ip = 127.0.0.1
property.component1.ip = 127.0.0.1
property.component1.foo = bar

Lexical analysis and tokenization sounds like my best route, but this is a very simple form of it. It's a simple grammar, a simple substitution and I'd like to make sure that I'm not bringing a sledgehammer to knock in a nail.

I could create my own lexer and tokenizer, or ANTlr is a possibility, but I don't like re-inventing the wheel and ANTlr sounds like overkill.

I'm not familiar with compiler techniques, so pointers in the right direction & code would be most appreciated.

Note: I can change the input format.

866

asked Mar 01 '10 20:03

Philip Reynolds

2 Answers

There's an excellent article on Using Regular Expressions for Lexical Analysis at effbot.org.

Adapting the tokenizer to your problem:

import re

token_pattern = r"""
(?P<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
|(?P<integer>[0-9]+)
|(?P<dot>\.)
|(?P<open_variable>[$][{])
|(?P<open_curly>[{])
|(?P<close_curly>[}])
|(?P<newline>\n)
|(?P<whitespace>\s+)
|(?P<equals>[=])
|(?P<slash>[/])
"""

token_re = re.compile(token_pattern, re.VERBOSE)

class TokenizerException(Exception): pass

def tokenize(text):
    pos = 0
    while True:
        m = token_re.match(text, pos)
        if not m: break
        pos = m.end()
        tokname = m.lastgroup
        tokvalue = m.group(tokname)
        yield tokname, tokvalue
    if pos != len(text):
        raise TokenizerException('tokenizer stopped at pos %r of %r' % (
            pos, len(text)))

To test it, we do:

stuff = r'property.${general.name}.ip = ${general.ip}'
stuff2 = r'''
general {
  name = myname
  ip = 127.0.0.1
}
'''

print ' stuff '.center(60, '=')
for tok in tokenize(stuff):
    print tok

print ' stuff2 '.center(60, '=')
for tok in tokenize(stuff2):
    print tok

for:

========================== stuff ===========================
('identifier', 'property')
('dot', '.')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'name')
('close_curly', '}')
('dot', '.')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'ip')
('close_curly', '}')
========================== stuff2 ==========================
('newline', '\n')
('identifier', 'general')
('whitespace', ' ')
('open_curly', '{')
('newline', '\n')
('whitespace', '  ')
('identifier', 'name')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('identifier', 'myname')
('newline', '\n')
('whitespace', '  ')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('integer', '127')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '1')
('newline', '\n')
('close_curly', '}')
('newline', '\n')

answered Sep 20 '22 20:09

Matt Anderson

A simple DFA works well for this. You only need a few states:

Looking for ${
Seen ${ looking for at least one valid character forming the name
Seen at least one valid name character, looking for more name characters or }.

If the properties file is order agnostic, you might want a two pass processor to verify that each name resolves correctly.

Of course, you then need to write the substitution code, but once you have a list of all the names used, the simplest possible implementation is a find/replace on ${name} with its corresponding value.

answered Sep 22 '22 20:09

Kaleb Pederson

Related questions
                            
                                python aiohttp into existing event loop
                            
                                Pandas dataframe: how do I split one row into multiple rows by multi-value column? [duplicate]
                            
                                Python: TypeError: <lambda>() takes 0 positional arguments but 1 was given due to assert
                            
                                How to install Python 3.6 on Ubuntu 19.04?
                            
                                How to create all django apps inside a folder?
                            
                                Activating conda environment during gitlab CI
                            
                                How can I run selected lines in Spyder 4?
                            
                                Visual Studio Code's debugger & pipenv
                            
                                How to test a FastAPI api endpoint that consumes images?
                            
                                Why does using "==" return a Series instead of bool in pandas?
                            
                                Python vs Julia speed comparison
                            
                                Can I remove pipenv cache folder? How to safely do it
                            
                                ASGI 'lifespan' protocol appears unsupported
                            
                                Can someone explain to my why my django admin theme is dark?
                            
                                Email integration
                            
                                Python *.py, *.pyo, *.pyc: Which can be eliminated for an Embedded System?
                            
                                Python iterators – how to dynamically assign self.next within a new style class?
                            
                                Why don't Django and CherryPy support HTTP verb-based dispatch natively?
                            
                                Confused by Django's claim to MVC, what is it exactly?
                            
                                create a grayscale image

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - lexical analysis and tokenization

Tags:

python

transform

lexical-analysis

Philip Reynolds

People also ask

2 Answers

Matt Anderson

Kaleb Pederson

Recent Activity

Donate For Us