I have a pandas dataframe that contains height information and I can't seem to figure out how to convert the somewhat unstructured information into an integer.
I figured the best way to approach this was to use regex but the main problem I'm having is that when I attempt to simplify a problem to use regex I usually take the first item in the dataframe (7 ' 5.5'') and try to use regex specifically on it. It seemed impossible for me to put this data in a string because of the quotes. So, I'm really confused on how to approach this problem.
here is my dataframe:
HeightNoShoes HeightShoes
0 7' 5.5" NaN
1 6' 11" 7' 0.25"
2 6' 7.75" 6' 9"
3 6' 5.5" 6' 6.75"
4 5' 11" 6' 0"
Output should be in inches:
HeightNoShoes HeightShoes
0 89.5 NaN
1 83 84.25
2 79.75 81
3 77.5 78.75
4 71 72
My next option would be writing this to csv and using excel, but I would prefer to learn how to do it in python/pandas. any help would be greatly appreciated.
The previous answer to the problem is a good solution to the problem without using regular expressions. I will post this in case you are curious about how to approach the problem using your first idea (using regexes).
It is possible to solve this using your approach of using a regular expression. In order to put the data you have (such as 7' 5.5") into a string in Python, you can escape the quote.
For example:
py_str = "7' 5.5\""
This, combined with a regular expression, will allow you to extract the information you need from the input data to calculate the output data. The input data consists of an integer (feet) followed by ', a space, and then a floating point number (inches). This float consists of one or more digits and then, optionally, a . and more digits. Here is a regular expression that can extract the feet and inches from the input data: ([0-9]+)' ([0-9]*\.?[0-9]+)"
The first group of the regex retrieves the feet and the second retrieves the inches. Here is an example of a function in python that returns a float, in inches, based on input data such as "7' 5.5\""
, or NaN if there is no valid match:
Code:
r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
def get_inches(el):
m = r.match(el)
if m == None:
return float('NaN')
else:
return int(m.group(1))*12 + float(m.group(2))
Example:
>>> get_inches("7' 5.5\"")
89.5
You could apply that regular expression to the elements in the data. However, the solution of mapping your own function over the data works well. Thought you might want to see how you could approach this using your original idea.
One possible method without using regex
is to write your own function and just apply
it to the column/Series of your choosing.
Code:
import pandas as pd
df = pd.read_csv("test.csv")
def parse_ht(ht):
# format: 7' 0.0"
ht_ = ht.split("' ")
ft_ = float(ht_[0])
in_ = float(ht_[1].replace("\"",""))
return (12*ft_) + in_
print df["HeightNoShoes"].apply(lambda x:parse_ht(x))
Output:
0 89.50
1 83.00
2 79.75
3 77.50
4 71.00
Name: HeightNoShoes, dtype: float64
Not perfectly elegant, but it does the job with minimal fuss. Best of all, it's easy to tweak and understand.
Comparison versus the accepted solution:
In [9]: import re
In [10]: r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
...: def get_inches2(el):
...: m = r.match(el)
...: if m == None:
...: return float('NaN')
...: else:
...: return int(m.group(1))*12 + float(m.group(2))
...:
In [11]: %timeit get_inches("7' 5.5\"")
100000 loops, best of 3: 3.51 µs per loop
In [12]: %timeit parse_ht("7' 5.5\"")
1000000 loops, best of 3: 1.24 µs per loop
parse_ht
is a little more than twice as fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With