Pythonic way to implement a tokenizer

Question

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice?

I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices.

Listing Token Types:

In Java, for example, I would have a list of fields like so:

public static final int TOKEN_INTEGER = 0

But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this with normal variable declarations but that doesn't strike me as a great solution since the declarations could be altered.

Returning Tokens From The Tokenizer:

Is there a better alternative to just simply returning a list of tuples e.g.

[ (TOKEN_INTEGER, 17), (TOKEN_STRING, "Sixteen")]?

Cheers,

Pete

AKX · Accepted Answer

There's an undocumented class in the re module called re.Scanner. It's very straightforward to use for a tokenizer:

import re
scanner=re.Scanner([
  (r"[0-9]+",       lambda scanner,token:("INTEGER", token)),
  (r"[a-z_]+",      lambda scanner,token:("IDENTIFIER", token)),
  (r"[,.]+",        lambda scanner,token:("PUNCTUATION", token)),
  (r"\s+", None), # None == skip token.
])

results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results

will result in

[('INTEGER', '45'),
 ('IDENTIFIER', 'pigeons'),
 ('PUNCTUATION', ','),
 ('INTEGER', '23'),
 ('IDENTIFIER', 'cows'),
 ('PUNCTUATION', ','),
 ('INTEGER', '11'),
 ('IDENTIFIER', 'spiders'),
 ('PUNCTUATION', '.')]

I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.

RossFabricant · Answer

Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.

Pythonic way to implement a tokenizer

Tags:

python

coding-style

tokenize

Peter

2 Answers

AKX

RossFabricant

Recent Activity

Donate For Us

Pythonic way to implement a tokenizer

Tags:

python

coding-style

tokenize

Peter

2 Answers

AKX

RossFabricant

Related questions

Recent Activity

Donate For Us