I have a dataframe like as shown below
tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']})
I would like to apply regex and create two columns based on rules given below
col val should store all text after value: and before first hyphen
col Adm should store all text after third hyphen
I tried the below but it doesn't work accurately
tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+')

Boolean Series in Pandas The between() function is used to get boolean Series equivalent to left <= series <= right. This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
replace() Pandas replace() is a very rich function that is used to replace a string, regex, dictionary, list, and series from the DataFrame. The values of the DataFrame can be replaced with other values dynamically. It is capable of working with the Python regex(regular expression). It differs from updating with .
Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .
Series.str.extracttdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)')
             0                  1
0       1.25MG            PO/TUBE
1       2.5 MG            PO/TUBE
2  18 UNITS(S)                   
3       850 MG  SC (SUBCUTANEOUS)
4        75 MG            PO/TUBE
Regex details:
^ : Assert position at start of linevalue: : Matches character sequence value:
\s?: Matches any whitespace character between zero and one time([^-]+) : First capturing group matches any character except - one or more times(?:\s-.*?-\s)? : Non capturing group match between zero and one time
\s: Matches single whitespace character- : Matches character -
.*? : Matches any character between zero and unlimited times but as few times as possible- : Matches character -
\s : Matches single whitespace character([^-]*) : Second capturing group matches any character except - zero or more times(?:\s|$) : Non capturing group
\s- : Matches single whitespace character| : Or switch$ : Assert position at the end of lineSee the online Regex demo
With your shown samples, could you please try following.
tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True)
tdf
Online demo for above regex
Output will be as follows.
                                                    text_1          val                  Adm
0                     value: 1.25MG - OM - PO/TUBE - ashaf       1.25MG             PO/TUBE 
1                        value:2.5 MG - OM - PO/TUBE -test       2.5 MG             PO/TUBE 
2                                       value: 18 UNITS(S)  18 UNITS(S)                  NaN
3  value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had       850 MG   SC (SUBCUTANEOUS) 
4                              value: 75 MG - OM - PO/TUBE        75 MG              PO/TUBE
Explanation: Adding detailed explanation for above.
^value:\s?       ##Checking if value starts from value: space is optional here.
(\S+             ##Starting 1st capturing group from here and matching all non space here.
  (?:\s[^-]+)?   ##In a non-capturing group matching space till - comes keeping it optional.
)                ##Closing 1st capturing group here.
(?:\s-\s.*?-     ##In a non-capturing group matching space-space till - first occurrence.
  ([^-]*)        ##Creating 2nd capturing group which has values till next - here.
  (?:-.*)?       ##In a non capturing group from - till end of value keeping it optional.
)?$              ##Closing non-capturing group at the end of the value here.
                        You can use
tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True)
# => >>> tdf
                                             text_1          val  \
0                        val: 1.25MG - OM - PO/TUBE       1.25MG   
1                         val:2.5 MG - OM - PO/TUBE       2.5 MG   
2                                  val: 18 UNITS(S)  18 UNITS(S)   
3  val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS)       850 MG   
4                         val: 75 MG - OM - PO/TUBE        75 MG   
0            PO/TUBE  
1            PO/TUBE  
2                NaN  
3  SC (SUBCUTANEOUS)  
4            PO/TUBE  
See the regex demo.
Details:
^val: - val: at the start of string (if val: is not always at the start of the string, remove ^ anchor)\s* - zero or more whitespaces([^-]*?) - Group 1: any chars other than - as few as possible(?:\s*-[^-]*-\s*(.*))? - an optional sequence of
\s* - zero or more whitespaces-[^-]*- - a -, any zero or more chars other than -, and then a -
\s* - zero or more whitespaces(.*) - Group 2: the rest of the line$ - end of string.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With