I know there are other tools out there to parse SQL statements, but I am rolling out my own for educational purposes. I am getting stuck with my grammar right now.. If you can spot an error real quick please let me know.
SELECT = r'SELECT'
FROM = r'FROM'
COLUMN = TABLE = r'[a-zA-Z]+'
COMMA = r','
STAR = r'\*'
END = r';'
t_ignore = ' ' #ignores spaces
statement : SELECT columns FROM TABLE END
columns : STAR
| rec_columns
rec_columns : COLUMN
| rec_columns COMMA COLUMN
When I try to parse a statement like 'SELECT a FROM b;' I get an syntax error at the FROM token... Any help is greatly appreciated!
(Edit) Code:
#!/usr/bin/python
import ply.lex as lex
import ply.yacc as yacc
tokens = (
'SELECT',
'FROM',
'WHERE',
'TABLE',
'COLUMN',
'STAR',
'COMMA',
'END',
)
t_SELECT = r'select|SELECT'
t_FROM = r'from|FROM'
t_WHERE = r'where|WHERE'
t_TABLE = r'[a-zA-Z]+'
t_COLUMN = r'[a-zA-Z]+'
t_STAR = r'\*'
t_COMMA = r','
t_END = r';'
t_ignore = ' \t'
def t_error(t):
print 'Illegal character "%s"' % t.value[0]
t.lexer.skip(1)
lex.lex()
NONE, SELECT, INSERT, DELETE, UPDATE = range(5)
states = ['NONE', 'SELECT', 'INSERT', 'DELETE', 'UPDATE']
current_state = NONE
def p_statement_expr(t):
'statement : expression'
print states[current_state], t[1]
def p_expr_select(t):
'expression : SELECT columns FROM TABLE END'
global current_state
current_state = SELECT
print t[3]
def p_recursive_columns(t):
'''recursive_columns : recursive_columns COMMA COLUMN'''
t[0] = ', '.join([t[1], t[3]])
def p_recursive_columns_base(t):
'''recursive_columns : COLUMN'''
t[0] = t[1]
def p_columns(t):
'''columns : STAR
| recursive_columns'''
t[0] = t[1]
def p_error(t):
print 'Syntax error at "%s"' % t.value if t else 'NULL'
global current_state
current_state = NONE
yacc.yacc()
while True:
try:
input = raw_input('sql> ')
except EOFError:
break
yacc.parse(input)
SQL Parsing The parsing stage involves separating the pieces of a SQL statement into a data structure that other routines can process. The database parses a statement when instructed by the application, which means that only the application, and not the database itself, can reduce the number of parses.
Simple SQL operations like LOAD, ALTER, INSERT, and UPDATE can turn parsing data from a chore into an efficient and mistake-free task.
format(first, reindent=True, keyword_case='upper')) SELECT * FROM foo; >>> # Parsing a SQL statement: >>> parsed = sqlparse. parse('select * from foo')[0] >>> parsed.
I think your problem is that your regular expressions for t_TABLE
and t_COLUMN
are also matching your reserved words (SELECT and FROM). In other words, SELECT a FROM b;
tokenizes to something like COLUMN COLUMN COLUMN COLUMN END
(or some other ambiguous tokenization) and this doesn't match any of your productions so you get a syntax error.
As a quick sanity check, change those regular expressions to match exactly what you're typing in like this:
t_TABLE = r'b'
t_COLUMN = r'a'
You will see that the syntax SELECT a FROM b;
passes because the regular expressions 'a' and 'b' don't match your reserved words.
And, there's another problem that the regular expressions for TABLE and COLUMN overlap as well, so the lexer can't tokenize without ambiguity with respect to those tokens either.
There's a subtle, but relevant section in the PLY documentation regarding this. Not sure the best way to explain this, but the trick is that the tokenization pass happens first so it can't really use context from your production rules to know whether it has come across a TABLE token or a COLUMN token. You need to generalize those into some kind of ID
token and then weed things out during the parse.
If I had some more energy I'd try to work through your code some more and provide an actual solution in code, but I think since you've already expressed that this is a learning exercise that perhaps you will be content with me pointing in the right direction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With