Python

Question

I have a column with object dtype, where numbers, text and symbols are all mixed up.

For example:

0 200 lbs today (189 last year)

1 99 lbs

2 250 lbs with clothes on (247 without)

3 current weight is 330

I need to extract only numbers, but I've been trying for hours without success.

I've tried with to_numeric like this:

raw['weight'] = pd.to_numeric(raw['weight'], errors='coerce', downcast='integer')

Given it's an object dtype, many parsing errors arise, but when I use coerce, the entire column becomesNaN`.

Any ideas?

The expected output would show all first numbers. The result from my example would be: 200, 99, 250, 300

Denver · Accepted Answer

You could try something like this:

import re

raw['weight'] = raw.Weight.apply(lambda x: re.search('[-+]?[0-9]+', x).group(0))

This would grab the first number found in the string. You would have to modify to get only the one in parenthesis, outside of parenthesis, etc.

[EDIT]

If NaN values are present in the Weight column the above example will fail. If you don't want to drop the NaN values you could handle them with something like this:

import re

def get_num(val):
    if not isinstance(val, str):
        return val
    else:
        return re.search('[-+]?[0-9]+', val).group(0)

raw['weight'] = raw.Weight.apply(lambda x: get_num(x))

Python - extract only first numbers

Tags:

Micaela De León

1 Answers

Denver

Recent Activity

Donate For Us