I have a data frame that looks like the following
I was wondering if there exist a fastest way to create a python dict in pandas that would hold data like following
table = {2: [4, 5, 6, 7, 8 ...], 4: [1, 2, 3, 4, ...]}
Here the keys are users ids and the values are uniques list of dates.
This can be done early in core python but was wondering if there is a pandas or numpy based method to compute this fast. I needed a fast solution that scales well when this data frame grows bigger.
Edit 1: Performances
Time taken: 14.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})
table = levels.drop_duplicates()\
.groupby(0)[1].apply(list)\
.to_dict()
print(table)
Time Taken: 17.4 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res.reset_index().drop_duplicates(['user_id','date']).groupby('user_id')['date'].apply(list).to_dict()
Time Taken: 294 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a = {k: list(pd.unique(list(zip(*g))[1]))
for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)
Time Taken: 15 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pd.Series(res.index.get_level_values(1), index=res.index.get_level_values(0)).groupby(level=0).apply(set).to_dict()
Edit 2: Benchmarking again
Wrong Result
idx = df.index.droplevel(-1).drop_duplicates()
l1, l2 = idx.levels
mapping = defaultdict(list)
for i, j in zip(l1, l2):
mapping[i].append(j)
Improved Timing: 14.6 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = {k: list(set(list(zip(*g))[1]))
for k, g in groupby(res.index.values.tolist(), itemgetter(0))}
In this example, we will be creating multi-index from dataframe using pandas. We will be creating manual data and then using pd. dataframe, we will create a dataframe with the set of data. Now using the Multi-index syntax we will create a multi-index with a dataframe.
Here's one solution using drop_duplicates
+ groupby
.
levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})
table = levels.drop_duplicates()\
.groupby(0)[1].apply(list)\
.to_dict()
print(table)
{1: [2, 3], 2: [8, 9]}
Setup
df = pd.DataFrame([[1, 2, 0, 3], [1, 2, 1, 4], [1, 3, 1, 5],
[2, 8, 1, 3], [2, 8, 1, 4], [2, 9, 2, 5]],
columns=['col1', 'col2', 'col3', 'col4'])
df = df.set_index(['col1', 'col2', 'col3'])
print(df)
col4
col1 col2 col3
1 2 0 3
1 4
3 1 5
2 8 1 3
1 4
9 2 5
Data from Jz
pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).to_dict()
Out[92]: {4: {'a', 'b'}, 5: {'a', 'b'}}
If you just need list , you can add apply(list)
PS : Personally do not think this step is needed
pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).apply(list).to_dict()
Out[93]: {4: ['b', 'a'], 5: ['b', 'a']}
I think if need better performance, use itertools.groupby
with unique
for return lists in same ordering as original data. If order is not important use set
:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}).set_index(['F','B', 'A'])
print (df)
C D E
F B A
a 4 a 7 1 5
5 b 8 3 3
4 c 9 5 6
b 5 d 4 7 9
e 2 1 2
4 f 3 0 4
from itertools import groupby
from operator import itemgetter
a = {k: list(set(list(zip(*g))[1]))
for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)
{'a': [4, 5], 'b': [5, 4]}
Another pandas solution:
d = df.reset_index().drop_duplicates(['F','B']).groupby('F')['B'].apply(list).to_dict()
print (d)
{'a': [4, 5], 'b': [5, 4]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With