Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

Here is my sample data:

import pandas as pd
import re
  
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
          1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
          2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
          3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
          4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
          5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
         'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})

Here is my desired output:

enter image description here

I have created a new column called 'HP' where I want to extract the horsepower figure from the original column ('Engine Information')

Here is the code I have tried to do this:

cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x)))

The idea is I want to regex match the pattern: 'a sequence of numbers that come before either 'hp' or ' hp'. This is because some of the cells have no 'space' in between the number and 'hp' as showed in my example.

I'm sure the regex is correct, because I have successfully done a similar process in R. However, I have tried functions such as str.extract, re.findall, re.search, re.match. Either returning errors or 'None' values (as shown in the sample). So here I am a bit lost.

Thanks!

like image 204
k3b Avatar asked Oct 02 '20 06:10

k3b


People also ask

How do I search for multiple patterns in Python?

Regex search groups or multiple patterns On a successful search, we can use match. group(1) to get the match value of a first group and match. group(2) to get the match value of a second group. Now let's see how to use these two patterns to search any six-letter word and two consecutive digits inside the target string.


2 Answers

You can use str.extract:

cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)

Details

  • (\d+)\s*hp\b - matches and captures into Group 1 one or more digits, then just matches 0 or more whitespaces (\s*) and hp (in a case insensitive way due to flags=re.I) as a whole word (since \b marks a word boundary)
  • str.extract only returns the captured value if there is a capturing group in the pattern, so the hp and whitespaces are not part of the result.

Python demo results:

>>> cars
                               Engine Information   HP
0         Honda 2.4L 4 cylinder 190 hp 162 ft-lbs  190
1  Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs  420
2          Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs  390
3          MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs  118
4       Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV  360
5           GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs  352
like image 162
Wiktor Stribiżew Avatar answered Oct 10 '22 02:10

Wiktor Stribiżew


There are several problems:

  • re.match just looks at the beginning of your string, use re.search if your pattern may appear anywhere
  • don't escape if you use a raw string, i.e. either'\\d hp' or r'\d hp' - raw strings help your exactly to avoid escaping
  • Return the matched group. You just search but do not yield the group found. re.search(rex, string) gives you a complex object (a match object) from this you can extract all groups, e.g. re.search(rex, string)[0]
  • you have to wrap the access in a separate function because you have to check if there was any match before accessing the group. If you don't do that, an exception may stop the apply process right in the middle
  • apply is slow; use pandas vectorized functions like extract: cars['Engine Information'].str.extract(r'(\d+) ?hp')

Your approach should work with this:

def match_horsepower(s):
    m = re.search(r'(\d+) ?hp', s)
    return int(m[1]) if m else None

cars['HP'] = cars['Engine Information'].apply(match_horsepower)
like image 32
CodeNStuff Avatar answered Oct 10 '22 02:10

CodeNStuff