I have a Pandas DataFrame that has a column that is basically a foreign key, as below:
Index | f_key | values
0 | 1 | red
1 | 2 | blue
2 | 1 | green
3 | 2 | yellow
4 | 3 | orange
5 | 1 | violet
What I would like is to add a column that labels the repeated foreign keys sequentially, as in "dup_number" below:
Index | dup_number | f_key | values
0 | 1 | 1 | red
1 | 1 | 2 | blue
2 | 2 | 1 | green
3 | 2 | 2 | yellow
4 | 1 | 3 | orange
5 | 3 | 1 | violet
The rows can be reordered if needed, I just need to get the "dup_number" keys in there. I wrote following code, which works fine, it gives me a Series which I can then add into the DataFrame, but it is very slow (that for loop is what kills the time), and I feel like it's way more complicated than is needed:
df = pd.DataFrame({'f_key': [1,2,1,2,3,1], 'values': ['red', 'blue', 'green', 'yellow', 'orange', 'violet']})
df_unique = df['f_key'].drop_duplicates().reset_index(drop=True)
dup_number = pd.DataFrame(columns = ['dup_number', 'temp_index'])
for n in np.arange(len(df_unique)):
sub_df = df.loc[df['f_key'] == df_unique[n]].reset_index()
dup_index = pd.DataFrame({'dup_number': sub_df.index.values[:]+1, 'temp_index': sub_df['index']})
dup_number = dup_number.append(dup_index)
dup_number = dup_number.set_index(dup_number['temp_index'].astype(int))
dup_number = dup_number['dup_number'].sort_index()
Any suggestions on faster/simpler ways to do this are appreciated!
You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method.
duplicated() function indicate duplicate Series values. The duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.
We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
You can use cumcount()
df['dup_number'] = df.groupby(['f_key']).cumcount()+1
f_key values dup_number
0 1 red 1
1 2 blue 1
2 1 green 2
3 2 yellow 2
4 3 orange 1
5 1 violet 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With