I have a data frame like this and I'm trying reshape my data frame using Pivot from Pandas in a way that I can keep some values from the original rows while making the duplicates row into columns and renaming them. Sometimes I have rows with 5 duplicates
I have been trying, but I don't get it.
import pandas as pd
df = pd.read_csv("C:dummy")
df = df.pivot(index=["ID"], columns=["Zone","PTC"], values=["Zone","PTC"])
# Rename columns and reset the index.
df.columns = [["PTC{}","Zone{}"],.format(c) for c in df.columns]
df.reset_index(inplace=True)
# Drop duplicates
df.drop(["PTC","Zone"], axis=1, inplace=True)
Input
ID Agent OV Zone Value PTC
1 10 26 M1 10 100
2 26.5 8 M2 50 95
2 26.5 8 M1 6 5
3 4.5 6 M3 4 40
3 4.5 6 M4 6 60
4 1.2 0.8 M1 8 100
5 2 0.4 M1 6 10
5 2 0.4 M2 41 86
5 2 0.4 M4 2 4
Output
ID Agent OV Zone1 Value1 PTC1 Zone2 Value2 PTC2 Zone3 Value3 PTC3
1 10 26 M_1 10 100 0 0 0 0 0 0
2 26.5 8 M_2 50 95 M_1 6 5 0 0 0
3 4.5 6 M_3 4 40 M_4 6 60 0 0 0
4 1.2 0.8 M_1 8 100 0 0 0 0 0 0
5 2 0.4 M_1 6 10 M_2 41 86 M_4 2 4
Pandas drop_duplicates() Function Syntax keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.
By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
The transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied.
To remove duplicates on specific column(s), use subset . To remove duplicates and keep last occurrences, use keep .
Use cumcount
for count groups, create MultiIndex
by set_index
with unstack
and last flatten values of columns:
g = df.groupby(["ID","Agent", "OV"]).cumcount().add(1)
df = df.set_index(["ID","Agent","OV", g]).unstack(fill_value=0).sort_index(axis=1, level=1)
df.columns = ["{}{}".format(a, b) for a, b in df.columns]
df = df.reset_index()
print (df)
ID Agent OV Zone1 Value1 PTC1 Zone2 Value2 PTC2 Zone3 Value3 PTC3
0 1 10.0 26.0 M1 10 100 0 0 0 0 0 0
1 2 26.5 8.0 M2 50 95 M1 6 5 0 0 0
2 3 4.5 6.0 M3 4 40 M4 6 60 0 0 0
3 4 1.2 0.8 M1 8 100 0 0 0 0 0 0
4 5 2.0 0.4 M1 6 10 M2 41 86 M4 2 4
If want replace to 0
only numeric columns:
g = df.groupby(["ID","Agent"]).cumcount().add(1)
df = df.set_index(["ID","Agent","OV", g]).unstack().sort_index(axis=1, level=1)
idx = pd.IndexSlice
df.loc[:, idx[['Value','PTC']]] = df.loc[:, idx[['Value','PTC']]].fillna(0).astype(int)
df.columns = ["{}{}".format(a, b) for a, b in df.columns]
df = df.fillna('').reset_index()
print (df)
ID Agent OV Zone1 Value1 PTC1 Zone2 Value2 PTC2 Zone3 Value3 PTC3
0 1 10.0 26.0 M1 10 100 0 0 0 0
1 2 26.5 8.0 M2 50 95 M1 6 5 0 0
2 3 4.5 6.0 M3 4 40 M4 6 60 0 0
3 4 1.2 0.8 M1 8 100 0 0 0 0
4 5 2.0 0.4 M1 6 10 M2 41 86 M4 2 4
You can using cumcount
create the help key , then we do unstack
with multiple index flatten (PS : you can add fillna(0) at the end , I did not add it cause I do not think for Zone value 0 is correct )
df['New']=df.groupby(['ID','Agent','OV']).cumcount()+1
new_df=df.set_index(['ID','Agent','OV','New']).unstack('New').sort_index(axis=1 , level=1)
new_df.columns=new_df.columns.map('{0[0]}{0[1]}'.format)
new_df
Out[40]:
Zone1 Value1 PTC1 Zone2 Value2 PTC2 Zone3 Value3 PTC3
ID Agent OV
1 10.0 26.0 M1 10.0 100.0 None NaN NaN None NaN NaN
2 26.5 8.0 M2 50.0 95.0 M1 6.0 5.0 None NaN NaN
3 4.5 6.0 M3 4.0 40.0 M4 6.0 60.0 None NaN NaN
4 1.2 0.8 M1 8.0 100.0 None NaN NaN None NaN NaN
5 2.0 0.4 M1 6.0 10.0 M2 41.0 86.0 M4 2.0 4.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With