I have a dataframe like as shown below
tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']})
I would like to apply regex and create two columns based on rules given below
col val
should store all text after value:
and before first hyphen
col Adm
should store all text after third hyphen
I tried the below but it doesn't work accurately
tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+')
Boolean Series in Pandas The between() function is used to get boolean Series equivalent to left <= series <= right. This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
replace() Pandas replace() is a very rich function that is used to replace a string, regex, dictionary, list, and series from the DataFrame. The values of the DataFrame can be replaced with other values dynamically. It is capable of working with the Python regex(regular expression). It differs from updating with .
Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .
Series.str.extract
tdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)')
0 1
0 1.25MG PO/TUBE
1 2.5 MG PO/TUBE
2 18 UNITS(S)
3 850 MG SC (SUBCUTANEOUS)
4 75 MG PO/TUBE
Regex details:
^
: Assert position at start of linevalue:
: Matches character sequence value:
\s?
: Matches any whitespace character between zero and one time([^-]+)
: First capturing group matches any character except -
one or more times(?:\s-.*?-\s)?
: Non capturing group match between zero and one time
\s
: Matches single whitespace character-
: Matches character -
.*?
: Matches any character between zero and unlimited times but as few times as possible-
: Matches character -
\s
: Matches single whitespace character([^-]*)
: Second capturing group matches any character except -
zero or more times(?:\s|$)
: Non capturing group
\s-
: Matches single whitespace character|
: Or switch$
: Assert position at the end of lineSee the online Regex demo
With your shown samples, could you please try following.
tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True)
tdf
Online demo for above regex
Output will be as follows.
text_1 val Adm
0 value: 1.25MG - OM - PO/TUBE - ashaf 1.25MG PO/TUBE
1 value:2.5 MG - OM - PO/TUBE -test 2.5 MG PO/TUBE
2 value: 18 UNITS(S) 18 UNITS(S) NaN
3 value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had 850 MG SC (SUBCUTANEOUS)
4 value: 75 MG - OM - PO/TUBE 75 MG PO/TUBE
Explanation: Adding detailed explanation for above.
^value:\s? ##Checking if value starts from value: space is optional here.
(\S+ ##Starting 1st capturing group from here and matching all non space here.
(?:\s[^-]+)? ##In a non-capturing group matching space till - comes keeping it optional.
) ##Closing 1st capturing group here.
(?:\s-\s.*?- ##In a non-capturing group matching space-space till - first occurrence.
([^-]*) ##Creating 2nd capturing group which has values till next - here.
(?:-.*)? ##In a non capturing group from - till end of value keeping it optional.
)?$ ##Closing non-capturing group at the end of the value here.
You can use
tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True)
# => >>> tdf
text_1 val \
0 val: 1.25MG - OM - PO/TUBE 1.25MG
1 val:2.5 MG - OM - PO/TUBE 2.5 MG
2 val: 18 UNITS(S) 18 UNITS(S)
3 val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) 850 MG
4 val: 75 MG - OM - PO/TUBE 75 MG
0 PO/TUBE
1 PO/TUBE
2 NaN
3 SC (SUBCUTANEOUS)
4 PO/TUBE
See the regex demo.
Details:
^val:
- val:
at the start of string (if val:
is not always at the start of the string, remove ^
anchor)\s*
- zero or more whitespaces([^-]*?)
- Group 1: any chars other than -
as few as possible(?:\s*-[^-]*-\s*(.*))?
- an optional sequence of
\s*
- zero or more whitespaces-[^-]*-
- a -
, any zero or more chars other than -
, and then a -
\s*
- zero or more whitespaces(.*)
- Group 2: the rest of the line$
- end of string.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With