I'm trying to count the number of duplicate values based on set of columns in a DataFrame.
Example:
print df
Month LSOA code Longitude Latitude Crime type
0 2015-01 E01000916 -0.106453 51.518207 Bicycle theft
1 2015-01 E01000914 -0.111497 51.518226 Burglary
2 2015-01 E01000914 -0.111497 51.518226 Burglary
3 2015-01 E01000914 -0.111497 51.518226 Other theft
4 2015-01 E01000914 -0.113767 51.517372 Theft from the person
My workaround:
counts = dict()
for i, row in df.iterrows():
key = (
row['Longitude'],
row['Latitude'],
row['Crime type']
)
if counts.has_key(key):
counts[key] = counts[key] + 1
else:
counts[key] = 1
And I get the counts:
{(-0.11376700000000001, 51.517371999999995, 'Theft from the person'): 1,
(-0.111497, 51.518226, 'Burglary'): 2,
(-0.111497, 51.518226, 'Other theft'): 1,
(-0.10645299999999999, 51.518207000000004, 'Bicycle theft'): 1}
Aside from the fact this code could be improved as well (feel free to comment how), what would be the way to do it through Pandas?
For those interested I'm working on a dataset from https://data.police.uk/
Operator. countOf() is used for counting the number of occurrences of b in a. It counts the number of occurrences of value. It returns the Count of a number of occurrences of value.
You can use groupby
with function size.
Then I reset index with rename column 0
to count
.
print df
Month LSOA code Longitude Latitude Crime type
0 2015-01 E01000916 -0.106453 51.518207 Bicycle theft
1 2015-01 E01000914 -0.111497 51.518226 Burglary
2 2015-01 E01000914 -0.111497 51.518226 Burglary
3 2015-01 E01000914 -0.111497 51.518226 Other theft
4 2015-01 E01000914 -0.113767 51.517372 Theft from the person
df = df.groupby(['Longitude', 'Latitude', 'Crime type']).size().reset_index(name='count')
print df
Longitude Latitude Crime type count
0 -0.113767 51.517372 Theft from the person 1
1 -0.111497 51.518226 Burglary 2
2 -0.111497 51.518226 Other theft 1
3 -0.106453 51.518207 Bicycle theft 1
print df['count']
0 1
1 2
2 1
3 1
Name: count, dtype: int64
An O(n) solution is possible via collections.Counter
:
from collections import Counter
c = Counter(list(zip(df.Longitude, df.Latitude, df.Crime_type)))
Result:
Counter({(-0.113767, 51.517372, 'Theft-from-the-person'): 1,
(-0.111497, 51.518226, 'Burglary'): 2,
(-0.111497, 51.518226, 'Other-theft'): 1,
(-0.106453, 51.518207, 'Bicycle-theft'): 1})
You can group on Longitude and Latitude, and then use value_counts
on the Crime type
column.
df.groupby(['Longitude', 'Latitude'])['Crime type'].value_counts().to_frame('count')
count
Longitude Latitude Crime type
-0.113767 51.517372 Theft from the person 1
-0.111497 51.518226 Burglary 2
Other theft 1
-0.106453 51.518207 Bicycle theft 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With