Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using PLY to parse SQL statements

I know there are other tools out there to parse SQL statements, but I am rolling out my own for educational purposes. I am getting stuck with my grammar right now.. If you can spot an error real quick please let me know.

SELECT = r'SELECT'
FROM = r'FROM'
COLUMN = TABLE = r'[a-zA-Z]+'
COMMA = r','
STAR = r'\*'
END = r';'
t_ignore = ' ' #ignores spaces

statement : SELECT columns FROM TABLE END

columns : STAR
        | rec_columns

rec_columns : COLUMN
            | rec_columns COMMA COLUMN

When I try to parse a statement like 'SELECT a FROM b;' I get an syntax error at the FROM token... Any help is greatly appreciated!

(Edit) Code:

#!/usr/bin/python
import ply.lex as lex
import ply.yacc as yacc

tokens = (
    'SELECT',
    'FROM',
    'WHERE',
    'TABLE',
    'COLUMN',
    'STAR',
    'COMMA',
    'END',
)

t_SELECT    = r'select|SELECT'
t_FROM      = r'from|FROM'
t_WHERE     = r'where|WHERE'
t_TABLE     = r'[a-zA-Z]+'
t_COLUMN    = r'[a-zA-Z]+'
t_STAR      = r'\*'
t_COMMA     = r','
t_END       = r';'

t_ignore    = ' \t'

def t_error(t):
    print 'Illegal character "%s"' % t.value[0]
    t.lexer.skip(1)

lex.lex()

NONE, SELECT, INSERT, DELETE, UPDATE = range(5)
states = ['NONE', 'SELECT', 'INSERT', 'DELETE', 'UPDATE']
current_state = NONE

def p_statement_expr(t):
    'statement : expression'
    print states[current_state], t[1]

def p_expr_select(t):
    'expression : SELECT columns FROM TABLE END'
    global current_state
    current_state = SELECT
    print t[3]


def p_recursive_columns(t):
    '''recursive_columns : recursive_columns COMMA COLUMN'''
    t[0] = ', '.join([t[1], t[3]])

def p_recursive_columns_base(t):
    '''recursive_columns : COLUMN'''
    t[0] = t[1]

def p_columns(t):
    '''columns : STAR
               | recursive_columns''' 
    t[0] = t[1]

def p_error(t):
    print 'Syntax error at "%s"' % t.value if t else 'NULL'
    global current_state
    current_state = NONE

yacc.yacc()


while True:
    try:
        input = raw_input('sql> ')
    except EOFError:
        break
    yacc.parse(input)
like image 771
sampwing Avatar asked Sep 08 '11 22:09

sampwing


People also ask

How SQL queries are parsed?

SQL Parsing The parsing stage involves separating the pieces of a SQL statement into a data structure that other routines can process. The database parses a statement when instructed by the application, which means that only the application, and not the database itself, can reduce the number of parses.

Can you parse data in SQL?

Simple SQL operations like LOAD, ALTER, INSERT, and UPDATE can turn parsing data from a chore into an efficient and mistake-free task.

How do you parse a SQL statement in Python?

format(first, reindent=True, keyword_case='upper')) SELECT * FROM foo; >>> # Parsing a SQL statement: >>> parsed = sqlparse. parse('select * from foo')[0] >>> parsed.


1 Answers

I think your problem is that your regular expressions for t_TABLE and t_COLUMN are also matching your reserved words (SELECT and FROM). In other words, SELECT a FROM b; tokenizes to something like COLUMN COLUMN COLUMN COLUMN END (or some other ambiguous tokenization) and this doesn't match any of your productions so you get a syntax error.

As a quick sanity check, change those regular expressions to match exactly what you're typing in like this:

t_TABLE = r'b'
t_COLUMN = r'a'

You will see that the syntax SELECT a FROM b; passes because the regular expressions 'a' and 'b' don't match your reserved words.

And, there's another problem that the regular expressions for TABLE and COLUMN overlap as well, so the lexer can't tokenize without ambiguity with respect to those tokens either.

There's a subtle, but relevant section in the PLY documentation regarding this. Not sure the best way to explain this, but the trick is that the tokenization pass happens first so it can't really use context from your production rules to know whether it has come across a TABLE token or a COLUMN token. You need to generalize those into some kind of ID token and then weed things out during the parse.

If I had some more energy I'd try to work through your code some more and provide an actual solution in code, but I think since you've already expressed that this is a learning exercise that perhaps you will be content with me pointing in the right direction.

like image 180
Joe Holloway Avatar answered Sep 20 '22 02:09

Joe Holloway