Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove comments from an Oracle SQL script using pyparsing in Python 3.x?

I have an oracle sql script containing multiple statements which are separated by ";". I'm trying to remove all comments within this entire script. This includes:

  1. single line comments -- till a newline.
  2. block comments /* ... */
  3. Additionally, I would like to be able to split the entire script into its statements.

I know of the library called sqlparse that can format to remove code, but I came across this edge case:

edge_case = """myval := oneval || otherval ||--;
-- single comment line
lpad(nvl(myval, ' '), 10, ' ');
"""

formatted = sqlparse.format(edge_case, strip_comments=True)
sqlparse.split(formatted)

# returns ['myval := oneval || otherval ||--;', "lpad(nvl(myval, ' '), 10, ' ');"]
# it splits into 2 statements when it should only be 1!

My idea is to focus on each commenting case, and then try to parse the rest of non-commented code by just string tokens which consists of all valid characters in sql.

from pyparsing import *

# define basic elements
semicolon = Literal(";")
# all valid chars excluding semicolon
all_valid_chars = alphanums + "_-+*/.,:()[]{}<>=&|"

# define an sql token
sql_token = Word(all_valid_chars)


# need to make sure that "--" is a special case
single_line_comment_token = Literal("--")
single_line_comment = single_line_comment_token + SkipTo(line_end) + line_end
test1_single_line_comment = """
-- single comment line
--- also a comment
---- also a comment
-- comment -- continues
-- comment /* also continues */

# - not a comment
"""
single_line_comment.run_tests(test1_single_line_comment)


# also handle multi line comments
multi_line_comment_open = Literal("/*")
multi_line_comment_close = Literal("*/")
multi_line_comment = multi_line_comment_open + SkipTo(multi_line_comment_close) + multi_line_comment_close

test1_multi_line_comment = """
/* multi line comment */
/* outer comment /* inner comment */

/* /* /* still valid */
/* -- still valid */
"""
multi_line_comment.run_tests(test1_multi_line_comment)

test2_multi_line_comment = """
/* multi line comment /* with nested comment should fail! */ */
"""
multi_line_comment.run_tests(test2_multi_line_comment, failure_tests=True)

I'm now stuck here and not sure where to continue.

statement2 = OneOrMore(single_line_comment | multi_line_comment | sql_token) + semicolon

# run tests
print(statement2.parseString("myval := oneval || otherval ||--;"))

My issues:

  1. The tests for each of these cases work, but I dont know how I should be testing the together to cover overlaps eg. -- /* this should be a single line comment etc.
  2. I don't know how to handle the rest of the non commented code properly. For example, all valid characters should include -, but this should mess up my single line code? Less technical issues faced:
  3. Honestly not sure where to even begin or how to structure my code since these cases can overlap. I'm not even sure if pyparsing is what I should be doing to solve this problem, but based on questions i've seen trying to remove comments from sql, parsing is the only real robust solution.
  4. run_tests does not work with multiple lines. This makes it hard to test the block comment although i can use parse_string.
like image 989
Chowder Avatar asked Jan 16 '26 18:01

Chowder


1 Answers

For stripping comments (or any pyparsing expression), you should use transform_string, not parse_string. transform_string scans through the input string and applies parse actions and suppressions. Here is some code to strip out python comments from a script:

import pyparsing as pp

comment = "#" + pp.rest_of_line
# could also use the provided `python_style_comment`
# comment = pp.python_style_comment

python_source = """
def hello(s):
    # say a nice hello!
    print(f"Hi, {s}!")
    # add a hashtag for social media
    print("#hello")
"""

# suppress comments and transform the string
print(comment.suppress().transform_string(python_source))

giving:

def hello(s):
    
    print(f"Hi, {s}!")
    
    print("

Oops, this does not detect the fact that #hello is inside a quoted string.

To fix this, we also insert a parser for quoted strings, but these we don't suppress:

# parse quoted strings first, in case they contain a comment
transform_expr = pp.quoted_string | comment.suppress()
print(transform_expr.transform_string(python_source))

Now giving:

def hello(s):
    
    print(f"Hi, {s}!")
    
    print("#hello")

For your SQL comment handling, you'll do much the same:

sql_single_line_comment = '--' + pp.rest_of_line

# use pyparsing's definition for C-style /* ... */ multiline comments
sql_multi_line_comment = pp.c_style_comment

comment_remover = (
    # parse quoted strings separately so that they don't get suppressed
    pp.quoted_string
    | (sql_single_line_comment | sql_multi_line_comment).suppress()
)

sql = "SELECT * FROM STUDENTS; -- watch out for Bobby Tables!"
print(comment_remover.transform_string(sql))

prints:

SELECT * FROM STUDENTS;

I'm glad to see you are using run_tests! If you want to define tests that span multiple lines, than instead of using the multiline string form, pass the test as a list of strings:

expr.run_tests([test_str1, test_str2, test_str3])

Unfortunately, I don't have a version of run_tests that calls transform_string.

Splitting the lines up into separate statements is best done in a second pass, after first stripping out the comments. You might try something like:

semi = pp.Literal(";")
semi.add_parse_action(pp.replace_with(";\n\n")

And then use the same transformation pattern as was used to strip comments (but this time, using the parse action instead of suppressing). Or use scan_string to locate the ";" terminators, and then write out the SQL that lies between semicolons out to their separate files (left as an exercise for the reader).

like image 145
PaulMcG Avatar answered Jan 19 '26 20:01

PaulMcG



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!