Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding a calculated column to pandas dataframe

Tags:

python

pandas

I am completely new to Python, pandas and programming in general, and I cannot figure out the following:

I have accessed a database with the help of pandas and I have put the data from the query into a dataframe, df. One of the column contains birthdays, which can have the following forms: - 01/25/1980 (string) - 01/25 (string) - None (NoneType)

Now, I would like to add a new column to df, which stores the ages of the people in the database. So I have done the following:

def addAge(df):

    today = date.today()
    df["age"] = None
    for index, row in df.iterrows():
        if row["birthday"] != None:
            if len(row["birthday"]) == 10:
                birthday = df["birthday"]
                birthdayDate = datetime.date(int(birthday[6:]), int(birthday[:2]), int(birthday[3:5])) 
                row["age"] = today.year - birthdayDate.year - ((today.month, today.day) < (birthdayDate.month, birthdayDate.day))
        print row["birthday"], row["age"]  #this is just for testing

addAge(df)
print df

The line print row["birthday"], row["age"] correctly prints the birthdays and the ages. But when I call print df, the column age always contains "None". Could you guys explain to me what I have been doing wrong? Thanks!

like image 956
Ixi Avatar asked Oct 01 '22 05:10

Ixi


1 Answers

When you call iterrows() you are getting copies of each row and cannot assign back to the larger dataframe. In general, you should be trying to using vectorized methods, rather than iterating over the rows.

So for example in this case, to parse the 'birthday' column, you could do something like this: For the rows that have a length of 10, the string will parsed into a datetime, otherwise it will be filled with a missing value.

import numpy as np
import pandas as pd
df['birthday'] = np.where(df['birthday'].str.len() == 10, pd.to_datetime(df['birthday']), '')

To calculate the ages, you can use .apply, which applies a function over each row of a series.
So if you wrapped your age calculation in a function:

def calculate_age(birthdayDate, today):
    if pd.isnull(birthdayDate):
        return np.nan
    else:
        return today.year - birthdayDate.year - 
                ((today.month, today.day) < (birthdayDate.month, birthdayDate.day))

Then, you could calculate the age column like this:

today = date.today()
df['age'] = df['birthday'].apply(lambda x: calculate_age(x, today))
like image 56
chrisb Avatar answered Oct 03 '22 06:10

chrisb