Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rename duplicated index values pandas DataFrame

Tags:

python

pandas

I have a DataFrame that contains some duplicated index values:

df1 =  pd.DataFrame( np.random.randn(6,6), columns = pd.date_range('1/1/2010', periods=6), index = {"A", "B", "C", "D", "E", "F"})
df1.rename(index = {"C": "A", "B": "E"}, inplace = 1)

ipdb> df1
      2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
 A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
 A   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
 E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
 F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
 D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
 E   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

I would like to change only the name of the duplicated values and to obtain a DataFrame like the following one:

ipdb> df1
     2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A_dp   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E_dp   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

My approach:

(i) Create dictionary with new names

old_names = df1[df1.index.duplicated()].index.values
new_names = df1[df1.index.duplicated()].index.values + "_dp"
dictionary = dict(zip(old_names, new_names))

(ii) Rename only the duplicated values

df1.loc[df1.index.duplicated(),:].rename(index = dictionary, inplace = True)

However this does not seem to work.

like image 863
Miquel Avatar asked Mar 29 '17 14:03

Miquel


2 Answers

You can use Index.where:

df1.index = df1.index.where(~df1.index.duplicated(), df1.index + '_dp')
print (df1)
      2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A      -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A_dp   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E      -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F       1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D       0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E_dp   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

And if need remove of duplicated index to unique:

print (df1)
   2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

df1.index = df1.index + df1.groupby(level=0).cumcount().astype(str).replace('0','')
print (df1)
    2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A    -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A1   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E    -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E1   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E2   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F     1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D     0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E3   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560
like image 140
jezrael Avatar answered Nov 12 '22 11:11

jezrael


I used jezrael's great answer in this renaming function:

def rn(df, suffix = '-duplicate-'):
    appendents = (suffix + df.groupby(level=0).cumcount().astype(str).replace('0','')).replace(suffix, '')
    return df.set_index(df.index + appendents)

then this:

df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8, 9]}, index=['a'+str(i) for i in [1,2,3,3,4,3,5,5, 6]])
rn(df)

spits out this:

    a
a1  1
a2  2
a3  3
a3-duplicate-1  4
a4  5
a3-duplicate-2  6
a5  7
a5-duplicate-1  8
a6  9
like image 26
kotrfa Avatar answered Nov 12 '22 12:11

kotrfa