I am completely new to Python, pandas and programming in general, and I cannot figure out the following:
I have accessed a database with the help of pandas and I have put the data from the query into a dataframe, df. One of the column contains birthdays, which can have the following forms: - 01/25/1980 (string) - 01/25 (string) - None (NoneType)
Now, I would like to add a new column to df, which stores the ages of the people in the database. So I have done the following:
def addAge(df):
today = date.today()
df["age"] = None
for index, row in df.iterrows():
if row["birthday"] != None:
if len(row["birthday"]) == 10:
birthday = df["birthday"]
birthdayDate = datetime.date(int(birthday[6:]), int(birthday[:2]), int(birthday[3:5]))
row["age"] = today.year - birthdayDate.year - ((today.month, today.day) < (birthdayDate.month, birthdayDate.day))
print row["birthday"], row["age"] #this is just for testing
addAge(df)
print df
The line print row["birthday"], row["age"] correctly prints the birthdays and the ages. But when I call print df, the column age always contains "None". Could you guys explain to me what I have been doing wrong? Thanks!
When you call iterrows()
you are getting copies of each row and cannot assign back to the larger dataframe. In general, you should be trying to using vectorized methods, rather than iterating over the rows.
So for example in this case, to parse the 'birthday' column, you could do something like this: For the rows that have a length of 10, the string will parsed into a datetime, otherwise it will be filled with a missing value.
import numpy as np
import pandas as pd
df['birthday'] = np.where(df['birthday'].str.len() == 10, pd.to_datetime(df['birthday']), '')
To calculate the ages, you can use .apply
, which applies a function over each row of a series.
So if you wrapped your age calculation in a function:
def calculate_age(birthdayDate, today):
if pd.isnull(birthdayDate):
return np.nan
else:
return today.year - birthdayDate.year -
((today.month, today.day) < (birthdayDate.month, birthdayDate.day))
Then, you could calculate the age column like this:
today = date.today()
df['age'] = df['birthday'].apply(lambda x: calculate_age(x, today))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With