I have a list of dates and the goal is to count the occurrences of each date while maintaining the order in which they appear in the original list. Consider the following example:
The list only_dates
looks like this:
[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]
I am trying to use groupby
:
import itertools
day_wise_counts = [(k, len(list(g))) for k, g in itertools.groupby(only_dates)]
print(str(day_wise_counts))
This prints
[(datetime.date(2017, 3, 10), 1), (datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 1), (datetime.date(2017, 3, 11), 1)]
I understand this is happening because ultimately each date object is being treated as a different one while grouping.
I was expecting the output to be:
[(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]
I am not necessarily looking for a list of tuples. A dictionary output will also suffice as long as the original order of dates is maintained. (OrderedDict
maybe).
How can I achieve this?
Update: There are possible multiple approaches being suggested which all work well. But I should have mentioned that I'll be doing this operation for a large amount data. So it would be great if your solution is optimal one in terms of running time. Please edit your answer/comment accordingly if you can.
Update 2: The size of data can be as large as 1 million rows.
Indeed, you could use an OrderedDict
:
from collections import OrderedDict
import datetime
inp = [datetime.date(2017, 3, 9), datetime.date(2017, 3, 10),
datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]
odct = OrderedDict()
for item in inp:
try:
odct[item] += 1
except KeyError:
odct[item] = 1
print(odct)
which prints:
OrderedDict([(datetime.date(2017, 3, 9), 1),
(datetime.date(2017, 3, 10), 2),
(datetime.date(2017, 3, 11), 1)])
You also asked for timings, so here they are:
from collections import OrderedDict, Counter
import datetime
import random
# Functions
def ordereddict(inp):
odct = OrderedDict()
for item in inp:
try:
odct[item] += 1
except KeyError:
odct[item] = 1
return odct
def dawg(inp):
cnts=Counter(inp)
seen=set()
return [(e, cnts[e]) for e in inp if not (e in seen or seen.add(e))]
def chris1(inp):
return [(item, inp.count(item)) for item in list(OrderedDict.fromkeys(inp))]
def chris2(inp):
c = Counter(inp)
return [(item,c[item]) for item in list(OrderedDict.fromkeys(inp))]
# Taken from answer: https://stackoverflow.com/a/23747652/5393381
class OrderedCounter(Counter, OrderedDict):
'Counter that remembers the order elements are first encountered'
def __repr__(self):
return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))
def __reduce__(self):
return self.__class__, (OrderedDict(self),)
# Timing setup
timings = {ordereddict: [], dawg: [], chris1: [], chris2: [], OrderedCounter: []}
sizes = [2**i for i in range(1, 20)]
# Timing
for size in sizes:
func_input = [datetime.date(2017, random.randint(1, 12), random.randint(1, 28)) for _ in range(size)]
for func in timings:
res = %timeit -o func(func_input) # if you use IPython, otherwise use the "timeit" module
timings[func].append(res)
and plotted:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot([2**i for i in range(1, 20)],
[time.best for time in timings[func]],
label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
I timed it on Python-3.5. The approaches using Counter
will likely be a bit slower on python-2.x (Counter
was optimized for python-3.x). Also the chris2
and dawg
approach overlap each other (because there is almost no time difference between them).
So except for the first approach of @Chris_Rands and the OrderedCounter
- the approaches perform very similar and mostly depend on the number of duplicates in your list.
It's mostly a factor of 1.5-2 difference. I couldn't find any real time difference for 1 million items bwteen the 3 "fast" approaches.
You could use list.count()
with a a list comprehension iterating over a list derived from an OrderedDict
of unique ordered dates:
import datetime
from collections import OrderedDict
lst = [datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]
[(item,lst.count(item)) for item in list(OrderedDict.fromkeys(lst))]
# [(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]
Or similarly using a collections.Counter
instead of list.count
:
from collections import Counter
c = Counter(lst)
[(item,c[item]) for item in list(OrderedDict.fromkeys(lst))]
# [(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]
Or use an OrderedCounter.
EDIT: see the excellent benchmark by @MSeifert.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With