Find the target word and the before word in col_a and append matched string in col_b_PY and col_c_LG columns
This code i have tried to achive this functionality but not able to
get the expected output. if any help appreciated
Here is the below code i approach with regular expressions:
df[''col_b_PY']=df.col_a.str.contains(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+)
{0,1}PY")
df.col_a.str.extract(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}PY",expand=True)
Dataframe looks like this
col_a
Python PY is a general-purpose language LG
Programming language LG in Python PY
Its easier LG to understand PY
The syntax of the language LG is clean PY
Desired output:
col_a col_b_PY col_c_LG
Python PY is a general-purpose language LG Python PY language LG
Programming language LG in Python PY Python PY language LG
Its easier LG to understand PY understand PY easier LG
The syntax of the language LG is clean PY clean PY language LG
You may use
df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+PY)\b")
df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+LG)\b")
Or, to extract all matches and join them with a space:
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
Note you need to use a capturing group in the regex pattern so that extract
could actually extract the text:
Extract capture groups in the regex pat as columns in a DataFrame.
Note the \b
word boundary is necessary to match PY
/ LG
as a whole word.
Also, if you want to only start a match from a letter, you may revamp the pattern to
r"([a-zA-Z][a-zA-Z'-]*\s+PY)\b"
r"([a-zA-Z][a-zA-Z'-]*\s+LG)\b"
^^^^^^^^ ^
where [a-zA-Z]
will match a letter and [a-zA-Z'-]*
will match 0 or more letters, apostrophes or hyphens.
Python 3.7 with Pandas 0.24.2:
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)
df = pd.DataFrame({
'col_a': ['Python PY is a general-purpose language LG',
'Programming language LG in Python PY',
'Its easier LG to understand PY',
'The syntax of the language LG is clean PY',
'Python PY is a general purpose PY language LG']
})
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
Output:
col_a col_b_PY col_c_LG
0 Python PY is a general-purpose language LG Python PY language LG
1 Programming language LG in Python PY Python PY language LG
2 Its easier LG to understand PY understand PY easier LG
3 The syntax of the language LG is clean PY clean PY language LG
4 Python PY is a general purpose PY language LG Python PY purpose PY language LG
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With