Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting duplicate values in Pandas DataFrame

I'm trying to count the number of duplicate values based on set of columns in a DataFrame.

Example:

print df

    Month   LSOA code   Longitude   Latitude    Crime type
0   2015-01 E01000916   -0.106453   51.518207   Bicycle theft
1   2015-01 E01000914   -0.111497   51.518226   Burglary
2   2015-01 E01000914   -0.111497   51.518226   Burglary
3   2015-01 E01000914   -0.111497   51.518226   Other theft
4   2015-01 E01000914   -0.113767   51.517372   Theft from the person

My workaround:

counts = dict()
for i, row in df.iterrows():
    key = (
            row['Longitude'],
            row['Latitude'],
            row['Crime type']
        )

    if counts.has_key(key):
        counts[key] = counts[key] + 1
    else:
        counts[key] = 1

And I get the counts:

{(-0.11376700000000001, 51.517371999999995, 'Theft from the person'): 1,
 (-0.111497, 51.518226, 'Burglary'): 2,
 (-0.111497, 51.518226, 'Other theft'): 1,
 (-0.10645299999999999, 51.518207000000004, 'Bicycle theft'): 1}

Aside from the fact this code could be improved as well (feel free to comment how), what would be the way to do it through Pandas?

For those interested I'm working on a dataset from https://data.police.uk/

like image 420
tales Avatar asked Nov 30 '15 07:11

tales


People also ask

How do you count the number of repeated values in Python?

Operator. countOf() is used for counting the number of occurrences of b in a. It counts the number of occurrences of value. It returns the Count of a number of occurrences of value.


3 Answers

You can use groupby with function size. Then I reset index with rename column 0 to count.

print df
  Month LSOA       code  Longitude   Latitude             Crime type
0    2015-01  E01000916  -0.106453  51.518207          Bicycle theft
1    2015-01  E01000914  -0.111497  51.518226               Burglary
2    2015-01  E01000914  -0.111497  51.518226               Burglary
3    2015-01  E01000914  -0.111497  51.518226            Other theft
4    2015-01  E01000914  -0.113767  51.517372  Theft from the person

df = df.groupby(['Longitude', 'Latitude', 'Crime type']).size().reset_index(name='count')
print df
   Longitude   Latitude             Crime type  count
0  -0.113767  51.517372  Theft from the person      1
1  -0.111497  51.518226               Burglary      2
2  -0.111497  51.518226            Other theft      1
3  -0.106453  51.518207          Bicycle theft      1

print df['count']
0    1
1    2
2    1
3    1
Name: count, dtype: int64
like image 74
jezrael Avatar answered Oct 22 '22 22:10

jezrael


An O(n) solution is possible via collections.Counter:

from collections import Counter

c = Counter(list(zip(df.Longitude, df.Latitude, df.Crime_type)))

Result:

Counter({(-0.113767, 51.517372, 'Theft-from-the-person'): 1,
         (-0.111497, 51.518226, 'Burglary'): 2,
         (-0.111497, 51.518226, 'Other-theft'): 1,
         (-0.106453, 51.518207, 'Bicycle-theft'): 1})
like image 40
jpp Avatar answered Oct 22 '22 22:10

jpp


You can group on Longitude and Latitude, and then use value_counts on the Crime type column.

df.groupby(['Longitude', 'Latitude'])['Crime type'].value_counts().to_frame('count')

                                           count
Longitude Latitude  Crime type                  
-0.113767 51.517372 Theft from the person      1
-0.111497 51.518226 Burglary                   2
                    Other theft                1
-0.106453 51.518207 Bicycle theft              1
like image 6
Alexander Avatar answered Oct 22 '22 22:10

Alexander