I have a dataframe like below where first column contains dates and other columns contain data on those dates:
date k1-v1 k1-v2 k2-v1 k2-v2 k1k3-v1 k1k3-v2 k4-v1 k4-v2
0 2021-01-05 2.0 7.0 NaN NaN NaN NaN 9.0 6.0
1 2021-01-31 NaN NaN 8.0 5.0 NaN NaN 7.0 6.0
2 2021-02-15 9.0 5.0 NaN 3.0 4.0 NaN NaN NaN
3 2021-02-28 NaN 9.0 0.0 1.0 NaN NaN 8.0 8.0
4 2021-03-20 7.0 NaN NaN NaN NaN NaN NaN NaN
5 2021-03-31 NaN NaN 8.0 NaN 3.0 NaN 8.0 0.0
6 2021-04-10 NaN NaN 7.0 6.0 NaN NaN NaN 9.0
7 2021-04-30 NaN 6.0 NaN NaN NaN NaN 1.0 NaN
8 2021-05-14 8.0 NaN 3.0 3.0 4.0 NaN NaN NaN
9 2021-05-31 NaN NaN 2.0 1.0 NaN NaN NaN NaN
The columns are always in pair: (
k1-v1
,
k1-v2
)
;(
k2-v1
,
k2-v2
)
;(
k1k3-v1
,
k1k3-v2
)
and so on N pairs. But the pair columns are not always in that order. So k1-v1 will not necessarily be followed by k1-v2 only, but there will be k1-v2 column somewhere in the dataframe. For simplicity, I showed them side-by-side.
I need to find last valid data date in each pair columns, and summarize it as below:
keys v1-last v2-last
0 k1 2021-05-14 2021-04-30
1 k2 2021-05-31 2021-05-31
2 k1k3 2021-05-14 NaN
3 k4 2021-04-30 2021-04-10
So for (
k1-v1
)
last valid data is 8.0
on date 2021-05-14
, for (
k2-v2
)
its 6.0
on 2021-04-30
. The columns v1-last
and v2-last
in above dataframe are then filled accordingly for k1, and similarly for others.
Currently I am doing it as below which is not very efficient on larger datasets:
df.set_index('date', inplace=True)
unique_cols = set([col[0] for col in df.columns.str.split('-')])
summarized_data = []
for col in unique_cols:
pair_df = df.loc[:,[col+'-v1',col+'-v2']].dropna(how='all')
v1_last_valid = pair_df.iloc[:,0].last_valid_index()
v2_last_valid = pair_df.iloc[:,1].last_valid_index()
summarized_data.append([col, v1_last_valid, v2_last_valid])
summarized_df = pd.DataFrame(summarized_data, columns=['keys','v1-last','v2-last'])
This works for now and gives me expected result but takes considerable amount of time when running on big datasets. Can the loop be avoided and this be done in a different and efficient manner?
s = df.set_index('date').stack()
s = s.reset_index().drop_duplicates('level_1', keep='last')
s[['keys', 'val']] = s['level_1'].str.split('-', expand=True)
s = s.pivot('keys', 'val', 'date').add_suffix('-last')
Set the index of dataframe to date
and stack
to reshape
date
2021-01-05 k1-v1 2.0
k1-v2 7.0
k4-v1 9.0
k4-v2 6.0
2021-01-31 k2-v1 8.0
k2-v2 5.0
k4-v1 7.0
k4-v2 6.0
...
2021-05-31 k2-v1 2.0
k2-v2 1.0
dtype: float64
Reset the index and drop the rows having duplicate values in level_1
date level_1 0
24 2021-04-10 k4-v2 9.0
25 2021-04-30 k1-v2 6.0
26 2021-04-30 k4-v1 1.0
27 2021-05-14 k1-v1 8.0
30 2021-05-14 k1k3-v1 4.0
31 2021-05-31 k2-v1 2.0
32 2021-05-31 k2-v2 1.0
Split
the strings in the level_1
column to create two additional columns keys
and val
date level_1 0 keys val
24 2021-04-10 k4-v2 9.0 k4 v2
25 2021-04-30 k1-v2 6.0 k1 v2
26 2021-04-30 k4-v1 1.0 k4 v1
27 2021-05-14 k1-v1 8.0 k1 v1
30 2021-05-14 k1k3-v1 4.0 k1k3 v1
31 2021-05-31 k2-v1 2.0 k2 v1
32 2021-05-31 k2-v2 1.0 k2 v2
Pivot
the dataframe to reshape and add suffix -last
to column names
val v1-last v2-last
keys
k1 2021-05-14 2021-04-30
k1k3 2021-05-14 NaN
k2 2021-05-31 2021-05-31
k4 2021-04-30 2021-04-10
We can reverse the columns' names and use pd.wide_to_long
where stubnames will be v_j
s, identifier will be date
and we call the k*
s as keys
in the result. Then we can groupby keys
and aggregate with DataFrame.last_valid_index
:
# reverse the column names
df.columns = df.columns.str.replace(r"(\w+)-(\w+)", r"\2-\1", regex=True)
# wide to long (and then make `keys` a column with reset_index)
long = pd.wide_to_long(df, stubnames=["v1", "v2"], i="date", j="keys",
sep="-", suffix=r"\w+").reset_index("keys")
# get the last valid dates & add a suffix
result = (long.groupby("keys")
.agg(pd.DataFrame.last_valid_index)
.add_suffix("-last"))
to get
>>> result
v1-last v2-last
keys
k1 2021-05-14 2021-04-30
k1k3 2021-05-14 None
k2 2021-05-31 2021-05-31
k4 2021-04-30 2021-04-10
To make the stubnames more generic for v_j
s:
stubnames = df.columns.str.extract(r"^(\w+)-", expand=False).dropna().unique()
# Index(["v1", "v2"], dtype="object")
Rename the column and then use wide_to_long
to restructure the dataframe. Stack
to remove NAN
. Then use groupby-agg
to extract the last value.
df2 = (
pd.wide_to_long(
df2.rename(columns=(lambda x: ''.join(x.split('-')[::-1]))),
stubnames=['v2', 'v1'],
i='date',
j='keys',
suffix='.*'
).stack()
.reset_index(0)
.groupby(level=[0, 1])
.agg({'date': 'last'})
.unstack(-1)
).add_suffix('-last')
df2.columns = df2.columns.droplevel()
OUTPUT:
v2-last v1-last
keys
k1 2021-04-30 2021-05-14
k1k3 NaN 2021-05-14
k2 2021-05-31 2021-05-31
k4 2021-04-10 2021-04-30
Combining ideas from @MustafaAydin and @ShubhamSharma, we could do this:
temp = df.set_index('date')
Apply pd.Series.last_valid_index
:
# you could use `agg` since it is a reducer
temp = temp.apply(pd.Series.last_valid_index)
Convert index to MultiIndex:
temp.index = temp.index.str.split("-", expand = True)
unstack and add suffix:
temp.unstack().add_suffix('_last')
v1_last v2_last
k1 2021-05-14 2021-04-30
k1k3 2021-05-14 None
k2 2021-05-31 2021-05-31
k4 2021-04-30 2021-04-10
First, define a function to get the required date for any given column name:
def last_date(col):
idx = df[col].last_valid_index()
return df.loc[idx, 'date'] if idx is not None else np.nan
Then group the keys in a dictionary, and use that to build the resulting dataframe:
keys = {}
for col in df.columns[1:]:
key, vn = col.split('-')
keys.setdefault(key, [None]*2)[int(vn[-1])-1] = last_date(col)
pd.DataFrame.from_records([[k] + dt_list for k, dt_list in keys.items()], columns=['keys', 'v1-last', 'v2-last'])
This produces the required output:
keys v1-last v2-last
0 k1 2021-05-14 2021-04-30
1 k2 2021-05-31 2021-05-31
2 k1k3 2021-05-14 NaN
3 k4 2021-04-30 2021-04-10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With