Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a new column only if values differ

My dataframe looks like this :

pd.DataFrame([["t1","d2","e3","r4"],
         ["t1","d2","e2","r4"],
         ["t1","d2","e1","r4"]],columns=["a","b","c","d"])

and I want:

pd.DataFrame([["t1","d2","e3","r4","e1","e2"]],
columns=["a","b","c","d","c1","c2"])

ie I have only 1 column that values differs and I want to create a new dataframe with columns added when new values are observed. Is there an easy way to do this ?

like image 682
FFL75 Avatar asked Sep 24 '18 13:09

FFL75


2 Answers

Edit: To generalize for any single non-unique column:

Ucols = df.columns[(df.nunique() == 1)].tolist()
df_out = df.set_index(Ucols).set_index(df.groupby(Ucols).cumcount(), append=True).unstack()
df_out.columns = [f'{i}{j}' if j != 0 else f'{i}' for i,j in df_out.columns]
print(df_out.reset_index())

Output:

    a   b   d   c  c1  c2
0  t1  d2  r4  e3  e2  e1

Original Answer

Use:

df_out = df.set_index(['a','b','d',df.groupby(['a','b','d']).cumcount()]).unstack()

df_out.columns = [f'{i}{j}' if j != 0 else f'{i}' for i,j in df_out.columns]

df_out.reset_index()

Output:

    a   b   d   c  c1  c2
0  t1  d2  r4  e3  e2  e1
like image 119
Scott Boston Avatar answered Oct 10 '22 04:10

Scott Boston


You can use a dictionary comprehension. For consistency, I've included integer labeling on all columns.

res = pd.DataFrame({f'{col}{idx}': val for col in df for idx, val in \
                    enumerate(df[col].unique(), 1)}, index=[0])

print(res)

   a1  b1  c1  c2  c3  d1
0  t1  d2  e3  e2  e1  r4

An alternative to df[col].unique() is df[col].drop_duplicates(), though the latter may incur an overhead for iterating a pd.Series object versus np.ndarray.

like image 21
jpp Avatar answered Oct 10 '22 04:10

jpp