Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

changing height (feet and inches) to an integer in python pandas

I have a pandas dataframe that contains height information and I can't seem to figure out how to convert the somewhat unstructured information into an integer.

I figured the best way to approach this was to use regex but the main problem I'm having is that when I attempt to simplify a problem to use regex I usually take the first item in the dataframe (7 ' 5.5'') and try to use regex specifically on it. It seemed impossible for me to put this data in a string because of the quotes. So, I'm really confused on how to approach this problem.

here is my dataframe:

    HeightNoShoes   HeightShoes
0   7' 5.5"             NaN
1   6' 11"           7' 0.25"
2   6' 7.75"            6' 9"
3   6' 5.5"          6' 6.75"
4   5' 11"           6' 0"

Output should be in inches:

    HeightNoShoes   HeightShoes
0   89.5                NaN
1   83                 84.25
2   79.75               81
3   77.5              78.75
4   71                  72

My next option would be writing this to csv and using excel, but I would prefer to learn how to do it in python/pandas. any help would be greatly appreciated.

like image 569
itjcms18 Avatar asked Nov 18 '14 04:11

itjcms18


2 Answers

The previous answer to the problem is a good solution to the problem without using regular expressions. I will post this in case you are curious about how to approach the problem using your first idea (using regexes).

It is possible to solve this using your approach of using a regular expression. In order to put the data you have (such as 7' 5.5") into a string in Python, you can escape the quote.

For example:

py_str = "7' 5.5\""

This, combined with a regular expression, will allow you to extract the information you need from the input data to calculate the output data. The input data consists of an integer (feet) followed by ', a space, and then a floating point number (inches). This float consists of one or more digits and then, optionally, a . and more digits. Here is a regular expression that can extract the feet and inches from the input data: ([0-9]+)' ([0-9]*\.?[0-9]+)"

The first group of the regex retrieves the feet and the second retrieves the inches. Here is an example of a function in python that returns a float, in inches, based on input data such as "7' 5.5\"", or NaN if there is no valid match:

Code:

r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
def get_inches(el):
    m = r.match(el)
    if m == None:
        return float('NaN')
    else:
        return int(m.group(1))*12 + float(m.group(2))

Example:

>>> get_inches("7' 5.5\"")
89.5

You could apply that regular expression to the elements in the data. However, the solution of mapping your own function over the data works well. Thought you might want to see how you could approach this using your original idea.

like image 156
Duke Avatar answered Sep 24 '22 22:09

Duke


One possible method without using regex is to write your own function and just apply it to the column/Series of your choosing.

Code:

import pandas as pd

df = pd.read_csv("test.csv")
def parse_ht(ht):
    # format: 7' 0.0"
    ht_ = ht.split("' ")
    ft_ = float(ht_[0])
    in_ = float(ht_[1].replace("\"",""))
    return (12*ft_) + in_

print df["HeightNoShoes"].apply(lambda x:parse_ht(x))

Output:

0    89.50
1    83.00
2    79.75
3    77.50
4    71.00
Name: HeightNoShoes, dtype: float64

Not perfectly elegant, but it does the job with minimal fuss. Best of all, it's easy to tweak and understand.

Comparison versus the accepted solution:

In [9]: import re

In [10]: r = re.compile(r"([0-9]+)' ([0-9]*\.?[0-9]+)\"")
    ...: def get_inches2(el):
    ...:     m = r.match(el)
    ...:     if m == None:
    ...:         return float('NaN')
    ...:     else:
    ...:         return int(m.group(1))*12 + float(m.group(2))
    ...:     

In [11]: %timeit get_inches("7' 5.5\"")
100000 loops, best of 3: 3.51 µs per loop

In [12]: %timeit parse_ht("7' 5.5\"")
1000000 loops, best of 3: 1.24 µs per loop

parse_ht is a little more than twice as fast.

like image 29
NullDev Avatar answered Sep 22 '22 22:09

NullDev