Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use regex to extract text in order?

Tags:

python

regex

I am working on a python flow script. Which basically parses SQL statements and get the target table and the source tables. Example

CREATE TABLE TAR_TAB1 AS
SELECT * FROM SRC_TAB1 JOIN SRC_TAB2

CREATE TABLE TAR_TAB2 AS 
SELECT * FROM SRC_TAB3 JOIN SRC_TAB4

INSERT INTO TAR_TAB3 AS
SELECT * FROM SRC_TAB5 JOIN SRC_TAB6

I wrote regex

target = re.findall(r'\w+(?=\s+AS)', data)
source = re.findall(r'(?:FROM|JOIN)\s*([^\s]+), data)

the results are as expected,

target list contains, TAR_TAB1, TAR_TAB2, TAR_TAB3
source list contains, SRC_TAB1, SRC_TAB2, SRC_TAB3, SRC_TAB4, ETC

HOW TO GET OUTPUT LIKE EITHER LIST OR DATA FRAME LIKE, preferably dataframe, as such the flow of extraction is not disturbed.

target         source
TAR_TAB1       SRC_TAB1
TAR_TAB1       SRC_TAB2
TAR_TAB2       SRC_TAB3
TAR_TAB2       SRC_TAB4
TAR_TAB3       SRC_TAB5
TAR_TAB4       SRC_TAB6
like image 580
Sai Astro Avatar asked May 26 '20 04:05

Sai Astro


People also ask

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1. 1* means any number of ones.

Does order matter in regex?

The order of the characters inside a character class does not matter. The results are identical. You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9.

How do you extract part of a string?

The substr() method extracts a part of a string. The substr() method begins at a specified position, and returns a specified number of characters. The substr() method does not change the original string. To extract characters from the end of the string, use a negative start position.


1 Answers

Here's a solution:

targets = re.findall(r'(?:CREATE\s+TABLE|INSERT\s+INTO)\s+([a-z0-9A-Z_]+)\s+AS', text)
sources = re.findall(r'SELECT\s+\*\s+FROM\s([a-z0-9A-Z_]+)\s+JOIN\s+([a-z0-9A-Z_]+)', text)

# each target has multiple sources, so repeat each target n times per 
# number of sources.
lens = [len(src) for src in sources]
targets = np.repeat(targets, lens) 

# 'flatten' the list of sources from [(t1, t2), (t3, t4)] to 
# [t1, t2, t3, t4]
sources = [tab for exp in sources for tab in exp]

pd.DataFrame({"src": sources, "tgt": targets})

Results:

        src       tgt
0  SRC_TAB1  TAR_TAB1
1  SRC_TAB2  TAR_TAB1
2  SRC_TAB3  TAR_TAB2
3  SRC_TAB4  TAR_TAB2
4  SRC_TAB5  TAR_TAB3
5  SRC_TAB6  TAR_TAB3
like image 129
Roy2012 Avatar answered Sep 19 '22 13:09

Roy2012