I am working on a python flow script. Which basically parses SQL statements and get the target table and the source tables. Example
CREATE TABLE TAR_TAB1 AS
SELECT * FROM SRC_TAB1 JOIN SRC_TAB2
CREATE TABLE TAR_TAB2 AS
SELECT * FROM SRC_TAB3 JOIN SRC_TAB4
INSERT INTO TAR_TAB3 AS
SELECT * FROM SRC_TAB5 JOIN SRC_TAB6
I wrote regex
target = re.findall(r'\w+(?=\s+AS)', data)
source = re.findall(r'(?:FROM|JOIN)\s*([^\s]+), data)
the results are as expected,
target list contains, TAR_TAB1, TAR_TAB2, TAR_TAB3
source list contains, SRC_TAB1, SRC_TAB2, SRC_TAB3, SRC_TAB4, ETC
HOW TO GET OUTPUT LIKE EITHER LIST OR DATA FRAME LIKE, preferably dataframe, as such the flow of extraction is not disturbed.
target source
TAR_TAB1 SRC_TAB1
TAR_TAB1 SRC_TAB2
TAR_TAB2 SRC_TAB3
TAR_TAB2 SRC_TAB4
TAR_TAB3 SRC_TAB5
TAR_TAB4 SRC_TAB6
$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.
Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1. 1* means any number of ones.
The order of the characters inside a character class does not matter. The results are identical. You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9.
The substr() method extracts a part of a string. The substr() method begins at a specified position, and returns a specified number of characters. The substr() method does not change the original string. To extract characters from the end of the string, use a negative start position.
Here's a solution:
targets = re.findall(r'(?:CREATE\s+TABLE|INSERT\s+INTO)\s+([a-z0-9A-Z_]+)\s+AS', text)
sources = re.findall(r'SELECT\s+\*\s+FROM\s([a-z0-9A-Z_]+)\s+JOIN\s+([a-z0-9A-Z_]+)', text)
# each target has multiple sources, so repeat each target n times per
# number of sources.
lens = [len(src) for src in sources]
targets = np.repeat(targets, lens)
# 'flatten' the list of sources from [(t1, t2), (t3, t4)] to
# [t1, t2, t3, t4]
sources = [tab for exp in sources for tab in exp]
pd.DataFrame({"src": sources, "tgt": targets})
Results:
src tgt
0 SRC_TAB1 TAR_TAB1
1 SRC_TAB2 TAR_TAB1
2 SRC_TAB3 TAR_TAB2
3 SRC_TAB4 TAR_TAB2
4 SRC_TAB5 TAR_TAB3
5 SRC_TAB6 TAR_TAB3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With