Here is my sample data:
import pandas as pd
import re
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})
Here is my desired output:
I have created a new column called 'HP' where I want to extract the horsepower figure from the original column ('Engine Information')
Here is the code I have tried to do this:
cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x)))
The idea is I want to regex match the pattern: 'a sequence of numbers that come before either 'hp' or ' hp'. This is because some of the cells have no 'space' in between the number and 'hp' as showed in my example.
I'm sure the regex is correct, because I have successfully done a similar process in R. However, I have tried functions such as str.extract
, re.findall
, re.search
, re.match
. Either returning errors or 'None' values (as shown in the sample). So here I am a bit lost.
Thanks!
Regex search groups or multiple patterns On a successful search, we can use match. group(1) to get the match value of a first group and match. group(2) to get the match value of a second group. Now let's see how to use these two patterns to search any six-letter word and two consecutive digits inside the target string.
You can use str.extract
:
cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)
Details
(\d+)\s*hp\b
- matches and captures into Group 1 one or more digits, then just matches 0 or more whitespaces (\s*
) and hp
(in a case insensitive way due to flags=re.I
) as a whole word (since \b
marks a word boundary)str.extract
only returns the captured value if there is a capturing group in the pattern, so the hp
and whitespaces are not part of the result.Python demo results:
>>> cars
Engine Information HP
0 Honda 2.4L 4 cylinder 190 hp 162 ft-lbs 190
1 Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs 420
2 Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs 390
3 MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs 118
4 Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV 360
5 GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs 352
There are several problems:
re.match
just looks at the beginning of your string, use re.search
if your pattern may appear anywhere'\\d hp'
or r'\d hp'
- raw strings help your exactly to avoid escapingre.search(rex, string)
gives you a complex object (a match object) from this you can extract all groups, e.g. re.search(rex, string)[0]
cars['Engine Information'].str.extract(r'(\d+) ?hp')
Your approach should work with this:
def match_horsepower(s):
m = re.search(r'(\d+) ?hp', s)
return int(m[1]) if m else None
cars['HP'] = cars['Engine Information'].apply(match_horsepower)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With