I'm writing a parser in Emacs Lisp. It's a parser for text files looking like this:
rule:
int: 1, 2, 3, ...
string: and, or, then, when
text:
----------
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Pellentesque
in tellus. In pharetra consequat augue. In congue. Curabitur
pellentesque iaculis eros. Proin magna odio, posuere sed, commodo nec,
varius nec, tortor.
----------
more: ...
rule:
...
I don't really care about the key (int, string, ...). I want the value. So for the file above int has value "1, 2, 3, ...", string "and, or, then, when" and text "Lorem ..." (excluding the dashes).
I'm thinking about two different solutions, but I don't which one to use. Should I:
create a simple parser that loops through all lines and for each line matches it with some regex and then group the parts I want out?
do a more sophisticated parser with a lexer and a parser?
Right now the files are quite simple and I guess I don't need to do something as advance as the second option. But these files may get a bit more complicated, so I want to make it easy to extend.
How would you solve this?
Are you already familiar with recursive descent parsers? They're relatively easy to write by hand in your favourite programming language, which would include Emacs Lisp. For very simple parsing, you can often get by with looking-at
and search-forward
. These would also form the basis of any tokenizing routines that would be called by your recursive descent parser, or any other style of parser.
[11 Feb 2009] I added an example recursive descent parser in emacs lisp below. It parses simple arithmetic expressions including addition, subtraction, multiplication, division, exponentiation, and parenthesized sub-expressions. Right now, it assumes all tokens are in the global variable *tokens*
, but if you modify gettok
and peektok
as necessary you can have them walk through a buffer. To use it as is, just try out the following:
(setq *token* '( 3 ^ 5 ^ 7 + 5 * 3 + 7 / 11))
(rdh/expr)
=> (+ (+ (^ 3 (^ 5 7)) (* 5 3)) (/ 7 11))
The parsing code follows.
(defun gettok ()
(and *token* (pop *token*)))
(defun peektok ()
(and *token* (car *token*)))
(defun rdh/expr ()
(rdh/expr-tail (rdh/factor)))
(defun rdh/expr-tail (expr)
(let ((tok (peektok)))
(cond ((or (null tok)
(equal tok ")"))
expr)
((member tok '(+ -))
(gettok)
(let ((fac (rdh/factor)))
(rdh/expr-tail (list tok expr fac))))
(t (error "bad expr")))))
(defun rdh/factor ()
(rdh/factor-tail (rdh/term)))
(defun rdh/factor-tail (fac)
(let ((tok (peektok)))
(cond ((or (null tok)
(member tok '(")" + -)))
fac)
((member tok '(* /))
(gettok)
(let ((term (rdh/term)))
(rdh/factor-tail (list tok fac term))))
(t (error "bad factor")))))
(defun rdh/term ()
(let* ((prim (rdh/prim))
(tok (peektok)))
(cond ((or (null tok)
(member tok '(")" + - / *)))
prim)
((equal tok '^)
(gettok)
(list tok prim (rdh/term)))
(t (error "bad term")))))
(defun rdh/prim ()
(let ((tok (gettok)))
(cond ((numberp tok) tok)
((equal tok "(")
(let* ((expr (rdh/expr))
(tok (peektok)))
(if (not (equal tok ")"))
(error "bad parenthesized expr")
(gettok)
expr)))
(t (error "bad prim")))))
for parser stuff look to the Semantic library from CEDET project
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With