Here is the data -
Account_Number Dummy_Account
1050080713252 ACC0000000000001
1050223213427 ACC0000000000002
1050080713252 ACC0000000169532
1105113502309 ACC0000000123005
1100043521537 ACC0000000000004
1100045301840 ACC0000000000005
1105113502309 ACC0000000000040
Rows 1,3
have duplicate values in Account_Number
. So do rows 4,7
.
I need to replace the duplicate values in Account_Number
with the same values in Dummy_Account
. So for 1050080713252
, both rows 1,3
should have same dummy values ACC0000000000001
. But instead of replacing directly, I want to keep the original mapping.
My expected output is -
Account_Number_Map Dummy_Account_Original
ACC0000000000001 ACC0000000000001
ACC0000000000002 ACC0000000000002
ACC0000000000001 ACC0000000169532
ACC0000000123005 ACC0000000123005
ACC0000000000004 ACC0000000000004
ACC0000000000005 ACC0000000000005
ACC0000000123005 ACC0000000000040
Since ACC0000000169532
is the duplicate Dummy_Account
w.r.t Account_Number
, I want to create a lookup that replaces this with ACC0000000000001
What I have tried
I started with creating a dict
like this -
maps = dict(zip(df.Dummy_Account, df.Account_Number))
I figured creating a dict
that will have the original Dummy_Account
values as key
and new Dummy_Account
values as value
But I am a little lost. My dataset is large so I am also looking at optimized solutions.
To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.
Option 1
I'd use groupby
and transform
with first
.transform
will broadcast the first encountered value across all instances
of the group.
df.assign(
Account_Number=
df.groupby('Account_Number')
.Dummy_Account
.transform('first')
)
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
Option 2
Use Numpy's np.unique
to get at an index of first values and an inverse.
The index (idx
) identifies where the first unique positions of 'Account_Number'
occured. I use this to slice 'Dummy_Account'
. I then use the inverse array (inv
) intended to put the unique values back into place but instead I use it on the things that were in those same positions from the coincident array.
u, idx, inv = np.unique(
df.Account_Number.values,
return_index=True,
return_inverse=True
)
df.assign(
Account_Number=
df.Dummy_Account.values[idx][inv]
)
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
Option 3
Or using pd.factorize
and pd.Series.duplicated
.
Similar to option 2, however I let duplicated
play the role of identifying where the first values are. I then slice the coincident values with the resulting boolean array then invert it with the result of pd.factorize
. f
plays the same exact role as inv
from option 2.
d = ~df.Account_Number.duplicated().values
f, u = pd.factorize(df.Account_Number.values)
df.assign(
Account_Number=
df.Dummy_Account.values[d][f]
)
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
Time Tests
Results
res.plot(loglog=True)
res.div(res.min(1), 0)
jez cold pir1 pir2 pir3
10 4.524811 3.819322 2.870916 1.000000 1.238144
30 4.833144 4.093932 3.310285 1.000000 1.382189
100 4.863337 4.048008 3.146154 1.000000 1.320060
300 5.144460 3.894850 3.157636 1.000000 1.357779
1000 4.870499 3.742524 2.348021 1.000000 1.069559
3000 5.375105 3.432398 1.852771 1.126024 1.000000
10000 7.100372 4.335100 1.890134 1.551161 1.000000
30000 7.227139 3.993985 1.530002 1.594531 1.000000
100000 8.052324 3.811728 1.380440 1.708170 1.000000
300000 8.690613 4.204664 1.539624 1.942090 1.000000
1000000 7.787494 3.668117 1.498758 2.129085 1.000000
Setup
def jez(d):
v = d.sort_values('Account_Number')
v['Account_Number'] = v['Dummy_Account'].mask(v.duplicated('Account_Number')).ffill()
return v.sort_index()
def cold(d):
m = d.drop_duplicates('Account_Number', keep='first')\
.set_index('Account_Number')\
.Dummy_Account
return d.assign(Account_Number=d.Account_Number.map(m))
def pir1(d):
return d.assign(
Account_Number=
d.groupby('Account_Number')
.Dummy_Account
.transform('first')
)
def pir2(d):
u, idx, inv = np.unique(
d.Account_Number.values,
return_index=True,
return_inverse=True
)
return d.assign(
Account_Number=
d.Dummy_Account.values[idx][inv]
)
def pir3(d):
p = ~d.Account_Number.duplicated().values
f, u = pd.factorize(d.Account_Number.values)
return d.assign(
Account_Number=
d.Dummy_Account.values[p][f]
)
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000,
30000, 100000, 300000, 1000000],
columns='jez cold pir1 pir2 pir3'.split(),
dtype=float
)
np.random.seed([3, 1415])
for i in res.index:
d = pd.DataFrame(dict(
Account_Number=np.random.randint(i // 2, size=i),
Dummy_Account=range(i)
))
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import {j}, d'
res.at[i, j] = timeit(stmt, setp, number=100)
Using drop_duplicates
, create a Series which you'll pass to map
:
m = df.drop_duplicates('Account_Number', keep='first')\
.set_index('Account_Number')\
.Dummy_Account
df.Account_Number = df.Account_Number.map(m)
df
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
Timings
df = pd.concat([df] * 1000000, ignore_index=True)
# jezrael's solution
%%timeit
v = df.sort_values('Account_Number')
v['Account_Number'] = v['Dummy_Account'].mask(v.duplicated('Account_Number')).ffill()
v.sort_index()
315 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# in this post
%%timeit
m = df.drop_duplicates('Account_Number', keep='first')\
.set_index('Account_Number')\
.Dummy_Account
df.Account_Number.map(m)
163 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that the performance will depend on your actual data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With