I have a dataframe like as shown below <pre class="prettyprint"><code>tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']}) </code></pre> I would like to apply regex and create two columns based on rules given below col <code>val</code> should store all text after <code>value:</code> and before <code>first hyphen</code> col <code>Adm</code> should store all text after <code>third hyphen</code> I tried the below but it doesn't work accurately <pre class="prettyprint"><code>tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+') </code></pre> <img src="https://i.stack.imgur.com/PhX2I.png" alt="enter image description here">

<h3><code>Series.str.extract</code></h3> <pre class="prettyprint"><code>tdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)') </code></pre> <hr> <pre class="prettyprint"><code> 0 1 0 1.25MG PO/TUBE 1 2.5 MG PO/TUBE 2 18 UNITS(S) 3 850 MG SC (SUBCUTANEOUS) 4 75 MG PO/TUBE </code></pre> Regex details: <ul> <li> <code>^</code> : Assert position at start of line</li> <li> <code>value:</code> : Matches character sequence <code>value:</code> </li> <li> <code>\s?</code>: Matches any whitespace character between zero and one time</li> <li> <code>([^-]+)</code> : First capturing group matches any character except <code>-</code> one or more times</li> <li> <code>(?:\s-.*?-\s)?</code> : Non capturing group match between zero and one time <ul> <li> <code>\s</code>: Matches single whitespace character</li> <li> <code>-</code> : Matches character <code>-</code> </li> <li> <code>.*?</code> : Matches any character between zero and unlimited times but as few times as possible</li> <li> <code>-</code> : Matches character <code>-</code> </li> <li> <code>\s</code> : Matches single whitespace character</li> </ul> </li> <li> <code>([^-]*)</code> : Second capturing group matches any character except <code>-</code> zero or more times</li> <li> <code>(?:\s|$)</code> : Non capturing group <ul> <li> <code>\s-</code> : Matches single whitespace character</li> <li> <code>|</code> : Or switch</li> <li> <code>$</code> : Assert position at the end of line</li> </ul> </li> </ul> <code>See the online Regex demo</code>

With your shown samples, could you please try following. <pre class="prettyprint"><code>tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True) tdf </code></pre> Online demo for above regex Output will be as follows. <pre class="prettyprint"><code> text_1 val Adm 0 value: 1.25MG - OM - PO/TUBE - ashaf 1.25MG PO/TUBE 1 value:2.5 MG - OM - PO/TUBE -test 2.5 MG PO/TUBE 2 value: 18 UNITS(S) 18 UNITS(S) NaN 3 value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had 850 MG SC (SUBCUTANEOUS) 4 value: 75 MG - OM - PO/TUBE 75 MG PO/TUBE </code></pre> Explanation: Adding detailed explanation for above. <pre class="prettyprint"><code>^value:\s? ##Checking if value starts from value: space is optional here. (\S+ ##Starting 1st capturing group from here and matching all non space here. (?:\s[^-]+)? ##In a non-capturing group matching space till - comes keeping it optional. ) ##Closing 1st capturing group here. (?:\s-\s.*?- ##In a non-capturing group matching space-space till - first occurrence. ([^-]*) ##Creating 2nd capturing group which has values till next - here. (?:-.*)? ##In a non capturing group from - till end of value keeping it optional. )?$ ##Closing non-capturing group at the end of the value here. </code></pre>

You can use <pre class="prettyprint lang-py prettyprint-override"><code>tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True) # => >>> tdf text_1 val \ 0 val: 1.25MG - OM - PO/TUBE 1.25MG 1 val:2.5 MG - OM - PO/TUBE 2.5 MG 2 val: 18 UNITS(S) 18 UNITS(S) 3 val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) 850 MG 4 val: 75 MG - OM - PO/TUBE 75 MG 0 PO/TUBE 1 PO/TUBE 2 NaN 3 SC (SUBCUTANEOUS) 4 PO/TUBE </code></pre> See the regex demo. Details: <ul> <li> <code>^val:</code> - <code>val:</code> at the start of string (if <code>val:</code> is not always at the start of the string, remove <code>^</code> anchor)</li> <li> <code>\s*</code> - zero or more whitespaces</li> <li> <code>([^-]*?)</code> - Group 1: any chars other than <code>-</code> as few as possible</li> <li> <code>(?:\s*-[^-]*-\s*(.*))?</code> - an optional sequence of <ul> <li> <code>\s*</code> - zero or more whitespaces</li> <li> <code>-[^-]*-</code> - a <code>-</code>, any zero or more chars other than <code>-</code>, and then a <code>-</code> </li> <li> <code>\s*</code> - zero or more whitespaces</li> <li> <code>(.*)</code> - Group 2: the rest of the line</li> </ul> </li> <li> <code>$</code> - end of string.</li> </ul>

Applying regex to pandas column based on different pos of same character

Q: What is regex in Pandas replace?

replace() Pandas replace() is a very rich function that is used to replace a string, regex, dictionary, list, and series from the DataFrame. The values of the DataFrame can be replaced with other values dynamically. It is capable of working with the Python regex(regular expression). It differs from updating with .

Q: Can Pandas column have different data types?

Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .

tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']})

I would like to apply regex and create two columns based on rules given below

col val should store all text after value: and before first hyphen

col Adm should store all text after third hyphen

I tried the below but it doesn't work accurately

tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+')

enter image description here

784

asked Apr 12 '21 10:04

The Great

3 Answers

`Series.str.extract`

tdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)')

             0                  1
0       1.25MG            PO/TUBE
1       2.5 MG            PO/TUBE
2  18 UNITS(S)                   
3       850 MG  SC (SUBCUTANEOUS)
4        75 MG            PO/TUBE

Regex details:

^ : Assert position at start of line
value: : Matches character sequence value:
\s?: Matches any whitespace character between zero and one time
([^-]+) : First capturing group matches any character except - one or more times
(?:\s-.*?-\s)? : Non capturing group match between zero and one time
- \s: Matches single whitespace character
- - : Matches character -
- .*? : Matches any character between zero and unlimited times but as few times as possible
- - : Matches character -
- \s : Matches single whitespace character
([^-]*) : Second capturing group matches any character except - zero or more times
(?:\s|$) : Non capturing group
- \s- : Matches single whitespace character
- | : Or switch
- $ : Assert position at the end of line

See the online Regex demo

186

answered Oct 20 '22 21:10

Shubham Sharma

With your shown samples, could you please try following.

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True)
tdf

Online demo for above regex

Output will be as follows.

                                                    text_1          val                  Adm
0                     value: 1.25MG - OM - PO/TUBE - ashaf       1.25MG             PO/TUBE 
1                        value:2.5 MG - OM - PO/TUBE -test       2.5 MG             PO/TUBE 
2                                       value: 18 UNITS(S)  18 UNITS(S)                  NaN
3  value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had       850 MG   SC (SUBCUTANEOUS) 
4                              value: 75 MG - OM - PO/TUBE        75 MG              PO/TUBE

Explanation: Adding detailed explanation for above.

^value:\s?       ##Checking if value starts from value: space is optional here.
(\S+             ##Starting 1st capturing group from here and matching all non space here.
  (?:\s[^-]+)?   ##In a non-capturing group matching space till - comes keeping it optional.
)                ##Closing 1st capturing group here.
(?:\s-\s.*?-     ##In a non-capturing group matching space-space till - first occurrence.
  ([^-]*)        ##Creating 2nd capturing group which has values till next - here.
  (?:-.*)?       ##In a non capturing group from - till end of value keeping it optional.
)?$              ##Closing non-capturing group at the end of the value here.

answered Oct 20 '22 20:10

RavinderSingh13

You can use

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True)
# => >>> tdf
                                             text_1          val  \
0                        val: 1.25MG - OM - PO/TUBE       1.25MG   
1                         val:2.5 MG - OM - PO/TUBE       2.5 MG   
2                                  val: 18 UNITS(S)  18 UNITS(S)   
3  val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS)       850 MG   
4                         val: 75 MG - OM - PO/TUBE        75 MG   


0            PO/TUBE  
1            PO/TUBE  
2                NaN  
3  SC (SUBCUTANEOUS)  
4            PO/TUBE

See the regex demo.

Details:

^val: - val: at the start of string (if val: is not always at the start of the string, remove ^ anchor)
\s* - zero or more whitespaces
([^-]*?) - Group 1: any chars other than - as few as possible
(?:\s*-[^-]*-\s*(.*))? - an optional sequence of
- \s* - zero or more whitespaces
- -[^-]*- - a -, any zero or more chars other than -, and then a -
- \s* - zero or more whitespaces
- (.*) - Group 2: the rest of the line
$ - end of string.

answered Oct 20 '22 19:10

Wiktor Stribiżew

Related questions
                            
                                Pyspark filter using startswith from list
                            
                                Trouble getting the trade-price using "Requests-HTML" library
                            
                                How to send bold text using Telegram Python bot
                            
                                replace index values in pandas dataframe with values from list
                            
                                Unpack dictionary from Pandas Column
                            
                                Python request.get fails to get an answer for a url I can open on my browser
                            
                                pip install FileNotFoundError: [Errno 2] No such file or directory:
                            
                                How to import my django app's models from command-line?
                            
                                Cumulative addition in a list based on an indices list
                            
                                Find the reCAPTCHA element and click on it -- Python + Selenium
                            
                                python "if len(A) is not 0" vs "if A" statements
                            
                                An error for generating an exe file using pyinstaller - typeerror: expected str, bytes or os.PathLike object, not NoneType
                            
                                Bulk Upsert with SQLAlchemy Postgres
                            
                                Add Ingress Rule to Security Groups using AWS CDK
                            
                                Class Attribute and metaclass in dataclasses
                            
                                What is the difference between "if x == True" and "if x:" [duplicate]
                            
                                xarray select nearest lat/lon with multi-dimension coordinates
                            
                                Chromedriver only supports characters in the BMP error while sending Emoji with ChromeDriver Chrome using Selenium Python to Tkinter's label() textbox
                            
                                How did python implement Type free variables from a statically typed language [duplicate]
                            
                                SyntaxError on "self.async" when running python kafka producer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Applying regex to pandas column based on different pos of same character

Tags:

python

string

regex

pandas

dataframe