Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number duplicates sequentially in Pandas DataFrame

I have a Pandas DataFrame that has a column that is basically a foreign key, as below:

Index   |  f_key  |    values
  0     |    1    |     red 
  1     |    2    |     blue 
  2     |    1    |     green 
  3     |    2    |     yellow 
  4     |    3    |     orange 
  5     |    1    |     violet

What I would like is to add a column that labels the repeated foreign keys sequentially, as in "dup_number" below:

Index   | dup_number |  f_key  |    values
  0     |     1      |    1    |     red 
  1     |     1      |    2    |     blue 
  2     |     2      |    1    |     green 
  3     |     2      |    2    |     yellow 
  4     |     1      |    3    |     orange 
  5     |     3      |    1    |     violet

The rows can be reordered if needed, I just need to get the "dup_number" keys in there. I wrote following code, which works fine, it gives me a Series which I can then add into the DataFrame, but it is very slow (that for loop is what kills the time), and I feel like it's way more complicated than is needed:

df = pd.DataFrame({'f_key': [1,2,1,2,3,1], 'values': ['red', 'blue', 'green', 'yellow', 'orange', 'violet']})
df_unique = df['f_key'].drop_duplicates().reset_index(drop=True)
dup_number = pd.DataFrame(columns = ['dup_number', 'temp_index'])
for n in np.arange(len(df_unique)):
    sub_df = df.loc[df['f_key'] == df_unique[n]].reset_index()
    dup_index = pd.DataFrame({'dup_number': sub_df.index.values[:]+1, 'temp_index': sub_df['index']})
    dup_number = dup_number.append(dup_index)
dup_number = dup_number.set_index(dup_number['temp_index'].astype(int))
dup_number = dup_number['dup_number'].sort_index()

Any suggestions on faster/simpler ways to do this are appreciated!

like image 400
Rick Berg Avatar asked Sep 14 '16 02:09

Rick Berg


People also ask

How do you count the number of repeated values in pandas?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method.

How do I find duplicates in pandas series?

duplicated() function indicate duplicate Series values. The duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.

How do you count occurrences of a value in a DataFrame?

We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.


1 Answers

You can use cumcount()

df['dup_number'] = df.groupby(['f_key']).cumcount()+1

           f_key  values  dup_number
    0      1     red           1
    1      2    blue           1
    2      1   green           2
    3      2  yellow           2
    4      3  orange           1
    5      1  violet           3
like image 85
Joe T. Boka Avatar answered Sep 20 '22 10:09

Joe T. Boka