I have a column with object dtype, where numbers, text and symbols are all mixed up.
For example:
0 200 lbs today (189 last year)
1 99 lbs
2 250 lbs with clothes on (247 without)
3 current weight is 330
I need to extract only numbers, but I've been trying for hours without success.
I've tried with to_numeric like this:
raw['weight'] = pd.to_numeric(raw['weight'], errors='coerce', downcast='integer')
Given it's an object dtype, many parsing errors arise, but when I use coerce, the entire column becomesNaN`.
Any ideas?
The expected output would show all first numbers. The result from my example would be: 200, 99, 250, 300
You could try something like this:
import re
raw['weight'] = raw.Weight.apply(lambda x: re.search('[-+]?[0-9]+', x).group(0))
This would grab the first number found in the string. You would have to modify to get only the one in parenthesis, outside of parenthesis, etc.
[EDIT]
If NaN values are present in the Weight column the above example will fail. If you don't want to drop the NaN values you could handle them with something like this:
import re
def get_num(val):
if not isinstance(val, str):
return val
else:
return re.search('[-+]?[0-9]+', val).group(0)
raw['weight'] = raw.Weight.apply(lambda x: get_num(x))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With