Here is my sample data: <pre class="prettyprint"><code>import pandas as pd import re cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs', 1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs', 2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs', 3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs', 4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV', 5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'}, 'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}}) </code></pre> Here is my desired output: <img src="https://i.stack.imgur.com/CldoV.png" alt="enter image description here"> I have created a new column called 'HP' where I want to extract the horsepower figure from the original column ('Engine Information') Here is the code I have tried to do this: <pre class="prettyprint"><code>cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x))) </code></pre> The idea is I want to regex match the pattern: 'a sequence of numbers that come before either 'hp' or ' hp'. This is because some of the cells have no 'space' in between the number and 'hp' as showed in my example. I'm sure the regex is correct, because I have successfully done a similar process in R. However, I have tried functions such as <code>str.extract</code>, <code>re.findall</code>, <code>re.search</code>, <code>re.match</code>. Either returning errors or 'None' values (as shown in the sample). So here I am a bit lost. Thanks!

You can use <code>str.extract</code>: <pre class="prettyprint"><code>cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I) </code></pre> Details <ul> <li> <code>(\d+)\s*hp\b</code> - matches and captures into Group 1 one or more digits, then just matches 0 or more whitespaces (<code>\s*</code>) and <code>hp</code> (in a case insensitive way due to <code>flags=re.I</code>) as a whole word (since <code>\b</code> marks a word boundary)</li> <li> <code>str.extract</code> only returns the captured value if there is a capturing group in the pattern, so the <code>hp</code> and whitespaces are not part of the result.</li> </ul> Python demo results: <pre class="prettyprint"><code>>>> cars Engine Information HP 0 Honda 2.4L 4 cylinder 190 hp 162 ft-lbs 190 1 Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs 420 2 Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs 390 3 MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs 118 4 Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV 360 5 GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs 352 </code></pre>

There are several problems: <ul> <li> <code>re.match</code> just looks at the beginning of your string, use <code>re.search</code> if your pattern may appear anywhere</li> <li>don't escape if you use a raw string, i.e. either<code>'\\d hp'</code> or <code>r'\d hp'</code> - raw strings help your exactly to avoid escaping</li> <li>Return the matched group. You just search but do not yield the group found. <code>re.search(rex, string)</code> gives you a complex object (a match object) from this you can extract all groups, e.g. <code>re.search(rex, string)[0]</code> </li> <li>you have to wrap the access in a separate function because you have to check if there was any match before accessing the group. If you don't do that, an exception may stop the apply process right in the middle</li> <li>apply is slow; use pandas vectorized functions like extract: <code>cars['Engine Information'].str.extract(r'(\d+) ?hp')</code> </li> </ul> Your approach should work with this: <pre class="prettyprint"><code>def match_horsepower(s): m = re.search(r'(\d+) ?hp', s) return int(m[1]) if m else None cars['HP'] = cars['Engine Information'].apply(match_horsepower) </code></pre>

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

Tags:

python

regex

pandas

spyder

Here is my sample data:

Click to copy

import pandas as pd
import re
  
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
          1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
          2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
          3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
          4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
          5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
         'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})

Here is my desired output:

enter image description here

I have created a new column called 'HP' where I want to extract the horsepower figure from the original column ('Engine Information')

Here is the code I have tried to do this:

Click to copy

cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x)))

The idea is I want to regex match the pattern: 'a sequence of numbers that come before either 'hp' or ' hp'. This is because some of the cells have no 'space' in between the number and 'hp' as showed in my example.

I'm sure the regex is correct, because I have successfully done a similar process in R. However, I have tried functions such as str.extract, re.findall, re.search, re.match. Either returning errors or 'None' values (as shown in the sample). So here I am a bit lost.

Thanks!

204

asked Oct 02 '20 06:10

k3b

2 Answers

You can use str.extract:

Click to copy

cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)

Details

(\d+)\s*hp\b - matches and captures into Group 1 one or more digits, then just matches 0 or more whitespaces (\s*) and hp (in a case insensitive way due to flags=re.I) as a whole word (since \b marks a word boundary)
str.extract only returns the captured value if there is a capturing group in the pattern, so the hp and whitespaces are not part of the result.

Python demo results:

Click to copy

>>> cars
                               Engine Information   HP
0         Honda 2.4L 4 cylinder 190 hp 162 ft-lbs  190
1  Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs  420
2          Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs  390
3          MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs  118
4       Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV  360
5           GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs  352

162

answered Oct 10 '22 02:10

Wiktor Stribiżew

There are several problems:

re.match just looks at the beginning of your string, use re.search if your pattern may appear anywhere
don't escape if you use a raw string, i.e. either'\\d hp' or r'\d hp' - raw strings help your exactly to avoid escaping
Return the matched group. You just search but do not yield the group found. re.search(rex, string) gives you a complex object (a match object) from this you can extract all groups, e.g. re.search(rex, string)[0]
you have to wrap the access in a separate function because you have to check if there was any match before accessing the group. If you don't do that, an exception may stop the apply process right in the middle
apply is slow; use pandas vectorized functions like extract: cars['Engine Information'].str.extract(r'(\d+) ?hp')

Your approach should work with this:

Click to copy

def match_horsepower(s):
    m = re.search(r'(\d+) ?hp', s)
    return int(m[1]) if m else None

cars['HP'] = cars['Engine Information'].apply(match_horsepower)

answered Oct 10 '22 02:10

CodeNStuff

Related questions
                            
                                Python 3.6 type hinting for a function accepting generic class type and instance type of the same generic type
                            
                                How do I make a circular tree with multiple root trees
                            
                                How to implement single sign-on django auth in azure ad?
                            
                                Shift "nan" to the beginning of an array in python [duplicate]
                            
                                To what extent does Google Colab support Python typing?
                            
                                Python Turtle Write Value in Containing Box
                            
                                What form of imports should I use in __main__.py and then how should I run the project?
                            
                                Keras loss and metrics values do not match with same function in each
                            
                                Fill Box Color in Box Plot
                            
                                ERROR: Unable to find py4j, your SPARK_HOME may not be configured correctly
                            
                                TypeError: required field "type_ignores" missing from Module
                            
                                Infinite scroll bar is not working with django
                            
                                Plotting networkx.Graph: how to change node position instead of resetting every node?
                            
                                What is the correct boilerplate for explicit relative imports?
                            
                                Python concurrent.futures Error in atexit._run_exitfuncs: OSError: handle is closed only running in Visual studio Debugging Mode
                            
                                Scrapy hidden memory leak
                            
                                How to convert a dataframe from long to wide, with values grouped by year in the index?
                            
                                How to specify external system dependencies to a Python package?
                            
                                creating a json object from pandas dataframe
                            
                                Decrypting AES CBC in python from OpenSSL AES

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

Tags:

python

regex

pandas

spyder

k3b

People also ask

2 Answers

Wiktor Stribiżew

CodeNStuff

Recent Activity

Donate For Us