I have a pandas data frame mydf
that has two columns,and both columns are datetime datatypes: mydate
and mytime
. I want to add three more columns: hour
, weekday
, and weeknum
.
def getH(t): #gives the hour return t.hour def getW(d): #gives the week number return d.isocalendar()[1] def getD(d): #gives the weekday return d.weekday() # 0 for Monday, 6 for Sunday mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1) mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1) mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip
or merge
? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
def getHWd(d,t): return t.hour, d.isocalendar()[1], d.weekday()
Using DataFrame. insert() method, we can add new columns at specific position of the column name sequence. Although insert takes single column name, value as input, but we can use it repeatedly to add multiple columns to the DataFrame.
Return Multiple Columns from pandas apply() You can return a Series from the apply() function that contains the new data. pass axis=1 to the apply() function which applies the function multiply to each row of the DataFrame, Returns a series of multiple columns from pandas apply() function.
Using pandas. DataFrame. apply() method you can execute a function to a single column, all and list of multiple columns (two or more).
Here's on approach to do it using one apply
Say, df
is like
In [64]: df Out[64]: mydate mytime 0 2011-01-01 2011-11-14 1 2011-01-02 2011-11-15 2 2011-01-03 2011-11-16 3 2011-01-04 2011-11-17 4 2011-01-05 2011-11-18 5 2011-01-06 2011-11-19 6 2011-01-07 2011-11-20 7 2011-01-08 2011-11-21 8 2011-01-09 2011-11-22 9 2011-01-10 2011-11-23 10 2011-01-11 2011-11-24 11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour, x['mydate'].isocalendar()[1], x['mydate'].weekday()])
And, apply
and store the result to df[['hour', 'weekday', 'weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
In [67]: df Out[67]: mydate mytime hour weekday weeknum 0 2011-01-01 2011-11-14 0 52 5 1 2011-01-02 2011-11-15 0 52 6 2 2011-01-03 2011-11-16 0 1 0 3 2011-01-04 2011-11-17 0 1 1 4 2011-01-05 2011-11-18 0 1 2 5 2011-01-06 2011-11-19 0 1 3 6 2011-01-07 2011-11-20 0 1 4 7 2011-01-08 2011-11-21 0 1 5 8 2011-01-09 2011-11-22 0 1 6 9 2011-01-10 2011-11-23 0 2 0 10 2011-01-11 2011-11-24 0 2 1 11 2011-01-12 2011-11-25 0 2 2
To complement John Galt's answer:
Depending on the task that is performed by lambdafunc
, you may experience some speedup by storing the result of apply
in a new DataFrame
and then joining with the original:
lambdafunc = lambda x: pd.Series([x['mytime'].hour, x['mydate'].isocalendar()[1], x['mydate'].weekday()]) newcols = df.apply(lambdafunc, axis=1) newcols.columns = ['hour', 'weekday', 'weeknum'] newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join
. You will be able to avoid the (always annoying) SettingWithCopyWarning
that may pop up when assigning directly on the columns:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With