My dataframe looks like this :
pd.DataFrame([["t1","d2","e3","r4"],
["t1","d2","e2","r4"],
["t1","d2","e1","r4"]],columns=["a","b","c","d"])
and I want:
pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
columns=["a","b","c","d","c1","c2"])
ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?
Ucols = df.columns[(df.nunique() == 1)].tolist()
df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
df_out.columns = [f'{i}{j}' if j != 0 else f'{i}' for i,j in df_out.columns]
print(df_out.reset_index())
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
Use:
df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()
df_out.columns = [f'{i}{j}' if j != 0 else f'{i}' for i,j in df_out.columns]
df_out.reset_index()
Output:
a b d c c1 c2
0 t1 d2 r4 e3 e2 e1
You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.
res = pd.DataFrame({f'{col}{idx}': val for col in df for idx, val in \
enumerate(df[col].unique(), 1)}, index=[0])
print(res)
a1 b1 c1 c2 c3 d1
0 t1 d2 e3 e2 e1 r4
An alternative to df[col].unique()
is df[col].drop_duplicates()
, though the latter may incur an overhead for iterating a pd.Series
object versus np.ndarray
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With