I have a data frame that looks like the following <img src="https://i.stack.imgur.com/JOtOC.png" alt="enter image description here"> I was wondering if there exist a fastest way to create a python dict in pandas that would hold data like following <pre class="prettyprint"><code>table = {2: [4, 5, 6, 7, 8 ...], 4: [1, 2, 3, 4, ...]} </code></pre> Here the keys are users ids and the values are uniques list of dates. This can be done early in core python but was wondering if there is a pandas or numpy based method to compute this fast. I needed a fast solution that scales well when this data frame grows bigger. Edit 1: Performances Time taken: 14.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <pre class="prettyprint"><code>levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)}) table = levels.drop_duplicates()\ .groupby(0)[1].apply(list)\ .to_dict() print(table) </code></pre> Time Taken: 17.4 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <pre class="prettyprint"><code>res.reset_index().drop_duplicates(['user_id','date']).groupby('user_id')['date'].apply(list).to_dict() </code></pre> Time Taken: 294 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <pre class="prettyprint"><code>a = {k: list(pd.unique(list(zip(*g))[1])) for k, g in groupby(df.index.values.tolist(), itemgetter(0))} print (a) </code></pre> Time Taken: 15 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <pre class="prettyprint"><code>pd.Series(res.index.get_level_values(1), index=res.index.get_level_values(0)).groupby(level=0).apply(set).to_dict() </code></pre> Edit 2: Benchmarking again Wrong Result <pre class="prettyprint"><code>idx = df.index.droplevel(-1).drop_duplicates() l1, l2 = idx.levels mapping = defaultdict(list) for i, j in zip(l1, l2): mapping[i].append(j) </code></pre> Improved Timing: 14.6 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <pre class="prettyprint"><code>a = {k: list(set(list(zip(*g))[1])) for k, g in groupby(res.index.values.tolist(), itemgetter(0))} </code></pre>

Here's one solution using <code>drop_duplicates</code> + <code>groupby</code>. <pre class="prettyprint"><code>levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)}) table = levels.drop_duplicates()\ .groupby(0)[1].apply(list)\ .to_dict() print(table) {1: [2, 3], 2: [8, 9]} </code></pre> Setup <pre class="prettyprint"><code>df = pd.DataFrame([[1, 2, 0, 3], [1, 2, 1, 4], [1, 3, 1, 5], [2, 8, 1, 3], [2, 8, 1, 4], [2, 9, 2, 5]], columns=['col1', 'col2', 'col3', 'col4']) df = df.set_index(['col1', 'col2', 'col3']) print(df) col4 col1 col2 col3 1 2 0 3 1 4 3 1 5 2 8 1 3 1 4 9 2 5 </code></pre>

Data from Jz <pre class="prettyprint"><code>pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).to_dict() Out[92]: {4: {'a', 'b'}, 5: {'a', 'b'}} </code></pre> If you just need list , you can add <code>apply(list)</code> PS : Personally do not think this step is needed <pre class="prettyprint"><code>pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).apply(list).to_dict() Out[93]: {4: ['b', 'a'], 5: ['b', 'a']} </code></pre>

How to get dict of first two indexes for multi index data frame

Tags:

python

indexing

pandas

dataframe

numpy

I have a data frame that looks like the following

enter image description here

I was wondering if there exist a fastest way to create a python dict in pandas that would hold data like following

Click to copy

table = {2: [4, 5, 6, 7, 8 ...], 4: [1, 2, 3, 4, ...]}

Here the keys are users ids and the values are uniques list of dates.

This can be done early in core python but was wondering if there is a pandas or numpy based method to compute this fast. I needed a fast solution that scales well when this data frame grows bigger.

Edit 1: Performances

Time taken: 14.3 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Click to copy

levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})

table = levels.drop_duplicates()\
              .groupby(0)[1].apply(list)\
              .to_dict()

print(table)

Time Taken: 17.4 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Click to copy

res.reset_index().drop_duplicates(['user_id','date']).groupby('user_id')['date'].apply(list).to_dict()

Time Taken: 294 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Click to copy

a = {k: list(pd.unique(list(zip(*g))[1])) 
     for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)

Time Taken: 15 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Click to copy

pd.Series(res.index.get_level_values(1), index=res.index.get_level_values(0)).groupby(level=0).apply(set).to_dict()

Edit 2: Benchmarking again

Wrong Result

Click to copy

idx = df.index.droplevel(-1).drop_duplicates()
l1, l2 = idx.levels
mapping = defaultdict(list)
for i, j in zip(l1, l2):
    mapping[i].append(j)

Improved Timing: 14.6 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Click to copy

a = {k: list(set(list(zip(*g))[1])) 
     for k, g in groupby(res.index.values.tolist(), itemgetter(0))}

922

asked Jul 02 '18 13:07

Mayukh Sarkar

3 Answers

Here's one solution using drop_duplicates + groupby.

Click to copy

levels = pd.DataFrame({k: df.index.get_level_values(k) for k in range(2)})

table = levels.drop_duplicates()\
              .groupby(0)[1].apply(list)\
              .to_dict()

print(table)

{1: [2, 3], 2: [8, 9]}

Setup

Click to copy

df = pd.DataFrame([[1, 2, 0, 3], [1, 2, 1, 4], [1, 3, 1, 5],
                   [2, 8, 1, 3], [2, 8, 1, 4], [2, 9, 2, 5]],
                  columns=['col1', 'col2', 'col3', 'col4'])

df = df.set_index(['col1', 'col2', 'col3'])

print(df)

                col4
col1 col2 col3      
1    2    0        3
          1        4
     3    1        5
2    8    1        3
          1        4
     9    2        5

131

answered Sep 23 '22 10:09

jpp

Data from Jz

Click to copy

pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).to_dict()
Out[92]: {4: {'a', 'b'}, 5: {'a', 'b'}}

If you just need list , you can add apply(list) PS : Personally do not think this step is needed

Click to copy

pd.Series(df.index.get_level_values(0),index=df.index.get_level_values(1)).groupby(level=0).apply(set).apply(list).to_dict()
Out[93]: {4: ['b', 'a'], 5: ['b', 'a']}

answered Sep 20 '22 10:09

BENY

I think if need better performance, use itertools.groupby with unique for return lists in same ordering as original data. If order is not important use set:

Click to copy

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')}).set_index(['F','B', 'A'])

print (df)
       C  D  E
F B A         
a 4 a  7  1  5
  5 b  8  3  3
  4 c  9  5  6
b 5 d  4  7  9
    e  2  1  2
  4 f  3  0  4

from  itertools import groupby
from operator import itemgetter

a = {k: list(set(list(zip(*g))[1])) 
     for k, g in groupby(df.index.values.tolist(), itemgetter(0))}
print (a)
{'a': [4, 5], 'b': [5, 4]}

Another pandas solution:

Click to copy

d = df.reset_index().drop_duplicates(['F','B']).groupby('F')['B'].apply(list).to_dict()
print (d)
{'a': [4, 5], 'b': [5, 4]}

answered Sep 23 '22 10:09

jezrael

Related questions
                            
                                Class cannot subclass 'QObject' (has type 'Any') using mypy
                            
                                Django decorator @transaction.non_atomic_requests not working in a ViewSet method
                            
                                How to find a columns set for a primary key candidate in CSV file?
                            
                                Strip() Function Using Regex
                            
                                Python / Dash : Multiple graphs inside a single subplot of the the figure
                            
                                How to add legend inside Python's Bokeh circle plot
                            
                                Define minimum length for PostgreSQL string column with SQLAlchemy
                            
                                s3fs gzip compression on pandas dataframe
                            
                                Environment variable coming up as 'None" using dotenv python
                            
                                How to check S3 bucket have tags or not
                            
                                How to base seaborn boxplot whiskers on percentiles?
                            
                                Creating pyinstaller from my script, missing PySide2.QtXml in the result
                            
                                where does python logging module write log by default?
                            
                                Running Stardew Valley from python on Windows
                            
                                How to return a single object with Django-Rest-Framework
                            
                                How to detect the epoch where Keras earlyStopping occurred?
                            
                                Change a url parameter
                            
                                cannot import wsgi from gevent
                            
                                applying lambda row on multiple columns pandas
                            
                                ValueError uses no argument in pytest, does order of decorators matter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get dict of first two indexes for multi index data frame

Tags:

python

indexing

pandas

dataframe

numpy

Mayukh Sarkar

People also ask

3 Answers

jpp

BENY

jezrael

Recent Activity

Donate For Us