I have an oracle sql script containing multiple statements which are separated by ";". I'm trying to remove all comments within this entire script. This includes:
-- till a newline.I know of the library called sqlparse that can format to remove code, but I came across this edge case:
edge_case = """myval := oneval || otherval ||--;
-- single comment line
lpad(nvl(myval, ' '), 10, ' ');
"""
formatted = sqlparse.format(edge_case, strip_comments=True)
sqlparse.split(formatted)
# returns ['myval := oneval || otherval ||--;', "lpad(nvl(myval, ' '), 10, ' ');"]
# it splits into 2 statements when it should only be 1!
My idea is to focus on each commenting case, and then try to parse the rest of non-commented code by just string tokens which consists of all valid characters in sql.
from pyparsing import *
# define basic elements
semicolon = Literal(";")
# all valid chars excluding semicolon
all_valid_chars = alphanums + "_-+*/.,:()[]{}<>=&|"
# define an sql token
sql_token = Word(all_valid_chars)
# need to make sure that "--" is a special case
single_line_comment_token = Literal("--")
single_line_comment = single_line_comment_token + SkipTo(line_end) + line_end
test1_single_line_comment = """
-- single comment line
--- also a comment
---- also a comment
-- comment -- continues
-- comment /* also continues */
# - not a comment
"""
single_line_comment.run_tests(test1_single_line_comment)
# also handle multi line comments
multi_line_comment_open = Literal("/*")
multi_line_comment_close = Literal("*/")
multi_line_comment = multi_line_comment_open + SkipTo(multi_line_comment_close) + multi_line_comment_close
test1_multi_line_comment = """
/* multi line comment */
/* outer comment /* inner comment */
/* /* /* still valid */
/* -- still valid */
"""
multi_line_comment.run_tests(test1_multi_line_comment)
test2_multi_line_comment = """
/* multi line comment /* with nested comment should fail! */ */
"""
multi_line_comment.run_tests(test2_multi_line_comment, failure_tests=True)
I'm now stuck here and not sure where to continue.
statement2 = OneOrMore(single_line_comment | multi_line_comment | sql_token) + semicolon
# run tests
print(statement2.parseString("myval := oneval || otherval ||--;"))
My issues:
-- /* this should be a single line comment etc.-, but this should mess up my single line code?
Less technical issues faced:run_tests does not work with multiple lines. This makes it hard to test the block comment although i can use parse_string.For stripping comments (or any pyparsing expression), you should use transform_string, not parse_string. transform_string scans through the input string and applies parse actions and suppressions. Here is some code to strip out python comments from a script:
import pyparsing as pp
comment = "#" + pp.rest_of_line
# could also use the provided `python_style_comment`
# comment = pp.python_style_comment
python_source = """
def hello(s):
# say a nice hello!
print(f"Hi, {s}!")
# add a hashtag for social media
print("#hello")
"""
# suppress comments and transform the string
print(comment.suppress().transform_string(python_source))
giving:
def hello(s):
print(f"Hi, {s}!")
print("
Oops, this does not detect the fact that #hello is inside a quoted string.
To fix this, we also insert a parser for quoted strings, but these we don't suppress:
# parse quoted strings first, in case they contain a comment
transform_expr = pp.quoted_string | comment.suppress()
print(transform_expr.transform_string(python_source))
Now giving:
def hello(s):
print(f"Hi, {s}!")
print("#hello")
For your SQL comment handling, you'll do much the same:
sql_single_line_comment = '--' + pp.rest_of_line
# use pyparsing's definition for C-style /* ... */ multiline comments
sql_multi_line_comment = pp.c_style_comment
comment_remover = (
# parse quoted strings separately so that they don't get suppressed
pp.quoted_string
| (sql_single_line_comment | sql_multi_line_comment).suppress()
)
sql = "SELECT * FROM STUDENTS; -- watch out for Bobby Tables!"
print(comment_remover.transform_string(sql))
prints:
SELECT * FROM STUDENTS;
I'm glad to see you are using run_tests! If you want to define tests that span multiple lines, than instead of using the multiline string form, pass the test as a list of strings:
expr.run_tests([test_str1, test_str2, test_str3])
Unfortunately, I don't have a version of run_tests that calls transform_string.
Splitting the lines up into separate statements is best done in a second pass, after first stripping out the comments. You might try something like:
semi = pp.Literal(";")
semi.add_parse_action(pp.replace_with(";\n\n")
And then use the same transformation pattern as was used to strip comments (but this time, using the parse action instead of suppressing). Or use scan_string to locate the ";" terminators, and then write out the SQL that lies between semicolons out to their separate files (left as an exercise for the reader).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With