Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying regex to pandas column based on different pos of same character

I have a dataframe like as shown below

tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']})

I would like to apply regex and create two columns based on rules given below

col val should store all text after value: and before first hyphen

col Adm should store all text after third hyphen

I tried the below but it doesn't work accurately

tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+')

enter image description here

like image 784
The Great Avatar asked Apr 12 '21 10:04

The Great


People also ask

How do you use between conditions in Pandas?

Boolean Series in Pandas The between() function is used to get boolean Series equivalent to left <= series <= right. This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

What is regex in Pandas replace?

replace() Pandas replace() is a very rich function that is used to replace a string, regex, dictionary, list, and series from the DataFrame. The values of the DataFrame can be replaced with other values dynamically. It is capable of working with the Python regex(regular expression). It differs from updating with .

Can Pandas column have different data types?

Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .


3 Answers

Series.str.extract

tdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)')

             0                  1
0       1.25MG            PO/TUBE
1       2.5 MG            PO/TUBE
2  18 UNITS(S)                   
3       850 MG  SC (SUBCUTANEOUS)
4        75 MG            PO/TUBE

Regex details:

  • ^ : Assert position at start of line
  • value: : Matches character sequence value:
  • \s?: Matches any whitespace character between zero and one time
  • ([^-]+) : First capturing group matches any character except - one or more times
  • (?:\s-.*?-\s)? : Non capturing group match between zero and one time
    • \s: Matches single whitespace character
    • - : Matches character -
    • .*? : Matches any character between zero and unlimited times but as few times as possible
    • - : Matches character -
    • \s : Matches single whitespace character
  • ([^-]*) : Second capturing group matches any character except - zero or more times
  • (?:\s|$) : Non capturing group
    • \s- : Matches single whitespace character
    • | : Or switch
    • $ : Assert position at the end of line

See the online Regex demo

like image 186
Shubham Sharma Avatar answered Oct 20 '22 21:10

Shubham Sharma


With your shown samples, could you please try following.

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True)
tdf

Online demo for above regex

Output will be as follows.

                                                    text_1          val                  Adm
0                     value: 1.25MG - OM - PO/TUBE - ashaf       1.25MG             PO/TUBE 
1                        value:2.5 MG - OM - PO/TUBE -test       2.5 MG             PO/TUBE 
2                                       value: 18 UNITS(S)  18 UNITS(S)                  NaN
3  value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had       850 MG   SC (SUBCUTANEOUS) 
4                              value: 75 MG - OM - PO/TUBE        75 MG              PO/TUBE

Explanation: Adding detailed explanation for above.

^value:\s?       ##Checking if value starts from value: space is optional here.
(\S+             ##Starting 1st capturing group from here and matching all non space here.
  (?:\s[^-]+)?   ##In a non-capturing group matching space till - comes keeping it optional.
)                ##Closing 1st capturing group here.
(?:\s-\s.*?-     ##In a non-capturing group matching space-space till - first occurrence.
  ([^-]*)        ##Creating 2nd capturing group which has values till next - here.
  (?:-.*)?       ##In a non capturing group from - till end of value keeping it optional.
)?$              ##Closing non-capturing group at the end of the value here.
like image 42
RavinderSingh13 Avatar answered Oct 20 '22 20:10

RavinderSingh13


You can use

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True)
# => >>> tdf
                                             text_1          val  \
0                        val: 1.25MG - OM - PO/TUBE       1.25MG   
1                         val:2.5 MG - OM - PO/TUBE       2.5 MG   
2                                  val: 18 UNITS(S)  18 UNITS(S)   
3  val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS)       850 MG   
4                         val: 75 MG - OM - PO/TUBE        75 MG   


0            PO/TUBE  
1            PO/TUBE  
2                NaN  
3  SC (SUBCUTANEOUS)  
4            PO/TUBE  

See the regex demo.

Details:

  • ^val: - val: at the start of string (if val: is not always at the start of the string, remove ^ anchor)
  • \s* - zero or more whitespaces
  • ([^-]*?) - Group 1: any chars other than - as few as possible
  • (?:\s*-[^-]*-\s*(.*))? - an optional sequence of
    • \s* - zero or more whitespaces
    • -[^-]*- - a -, any zero or more chars other than -, and then a -
    • \s* - zero or more whitespaces
    • (.*) - Group 2: the rest of the line
  • $ - end of string.
like image 28
Wiktor Stribiżew Avatar answered Oct 20 '22 19:10

Wiktor Stribiżew