I have 3 <code>DataFrames</code> that have differing numbers of shared indexes. For example: <pre class="prettyprint"><code>>>> df0=pd.DataFrame(index=pd.MultiIndex.from_product([[1,2,3,4],[2011,2012],['A','B']], names=['Season','Year','Location'])) >>> df0['Value0']=np.random.randint(1,100,len(df0)) >>> >>> df1=pd.DataFrame(index=pd.MultiIndex.from_product([[2011,2012],['A','B']], names=['Year','Location'])) >>> df1['Value1']=np.random.randint(1,100,len(df1)) >>> >>> df2=pd.DataFrame(index=['A','B']) >>> df2.index.name='Location' >>> df2['Value2']=np.random.randint(1,100,len(df2)) >>> df0 Value0 Season Year Location 1 2011 A 18 B 63 2012 A 88 B 30 2 2011 A 35 B 60 2012 A 61 B 4 3 2011 A 70 B 9 2012 A 11 B 38 4 2011 A 68 B 57 2012 A 13 B 35 >>> df1 Value1 Year Location 2011 A 22 B 74 2012 A 73 B 44 >>> df2 Value2 Location A 70 B 85 >>> </code></pre> I am looking for the best way to join them on their common indexes. Things I have tried: 1) <code>pd.concat([df0,df1,df2],1)</code> would be nice because it accepts a list of dataframes, but this appears to only work if the dataframes have the same number of indexes. 2) Joining either of the multiindex dataframes with the single index <code>DataFrame</code> works: <code>df1.join(df2)</code> or <code>df0.join(df2)</code>. However joining the <code>DataFrame</code> with two indexes with the <code>DataFrame</code> with three indexes does not: <code>df0.join(df1)</code> and gives me the following error: "NotImplementedError: merging with more than one level overlap on a multi-index is not implemented" At this point, the approach I have been taking is to reset the indexes and use <code>pd.merge()</code>. See below: <pre class="prettyprint"><code>def JoinMulti(DFList): FinalDF=DFList[0].reset_index() for OtherDF in DFList[1:]: FinalDF=pd.merge(FinalDF, OtherDF.reset_index(), how='outer') #Now I want reindex it so that it's indexed the same as the `DataFrame` with the highest number of levels NLevels=[x.index.nlevels for x in DFList] MaxIndexPos=NLevels.index(max(NLevels)) FinalIndex=DFList[MaxIndexPos].index FinalDF=FinalDF.set_index(FinalIndex.names).reindex(FinalIndex) return FinalDF >>> JoinMulti([df0,df1,df2]) Value0 Value1 Value2 Season Year Location 1 2011 A 43 5 96 B 63 46 97 2012 A 68 6 96 B 23 99 97 2 2011 A 66 5 96 B 30 46 97 2012 A 45 6 96 B 79 99 97 3 2011 A 66 5 96 B 21 46 97 2012 A 86 6 96 B 11 99 97 4 2011 A 95 5 96 B 58 46 97 2012 A 32 6 96 B 80 99 97 >>> </code></pre> Is this a reasonable approach? Are there any improvements that could be made or any exceptions that I'm forgetting about?

I modified solution by Stefan Jansen: <pre class="prettyprint"><code>def jez(df0,df1,df2): df1 = df1.join(df2) df0 = df0.reset_index('Season') FinalDF = df0.join(df1).set_index('Season', append=True).reorder_levels(['Season', 'Year', 'Location']).sortlevel() return FinalDF print jez(df0,df1,df2) </code></pre> Timing: <pre class="prettyprint"><code>In [41]: %timeit jez(df0,df1,df2) The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached 100 loops, best of 3: 5.02 ms per loop In [42]: %timeit JoinMulti([df0,df1,df2]) 100 loops, best of 3: 9.83 ms per loop </code></pre>

Joining Pandas Dataframes on shared indexes

Tags:

python

pandas

dataframe

I have 3 DataFrames that have differing numbers of shared indexes. For example:

>>> df0=pd.DataFrame(index=pd.MultiIndex.from_product([[1,2,3,4],[2011,2012],['A','B']], names=['Season','Year','Location']))
>>> df0['Value0']=np.random.randint(1,100,len(df0))
>>> 
>>> df1=pd.DataFrame(index=pd.MultiIndex.from_product([[2011,2012],['A','B']], names=['Year','Location']))
>>> df1['Value1']=np.random.randint(1,100,len(df1))
>>> 
>>> df2=pd.DataFrame(index=['A','B'])
>>> df2.index.name='Location'
>>> df2['Value2']=np.random.randint(1,100,len(df2))
>>> df0
                      Value0
Season Year Location        
1      2011 A             18
            B             63
       2012 A             88
            B             30
2      2011 A             35
            B             60
       2012 A             61
            B              4
3      2011 A             70
            B              9
       2012 A             11
            B             38
4      2011 A             68
            B             57
       2012 A             13
            B             35
>>> df1
               Value1
Year Location        
2011 A             22
     B             74
2012 A             73
     B             44
>>> df2
          Value2
Location        
A             70
B             85
>>>

I am looking for the best way to join them on their common indexes.

Things I have tried:

1) pd.concat([df0,df1,df2],1) would be nice because it accepts a list of dataframes, but this appears to only work if the dataframes have the same number of indexes.

2) Joining either of the multiindex dataframes with the single index DataFrame works: df1.join(df2) or df0.join(df2). However joining the DataFrame with two indexes with the DataFrame with three indexes does not: df0.join(df1) and gives me the following error: "NotImplementedError: merging with more than one level overlap on a multi-index is not implemented"

At this point, the approach I have been taking is to reset the indexes and use pd.merge(). See below:

def JoinMulti(DFList):
    FinalDF=DFList[0].reset_index()
    for OtherDF in DFList[1:]:
        FinalDF=pd.merge(FinalDF, OtherDF.reset_index(), how='outer')

    #Now I want reindex it so that it's indexed the same as the `DataFrame` with the highest number of levels
    NLevels=[x.index.nlevels for x in DFList]
    MaxIndexPos=NLevels.index(max(NLevels))
    FinalIndex=DFList[MaxIndexPos].index
    FinalDF=FinalDF.set_index(FinalIndex.names).reindex(FinalIndex)
    return FinalDF

>>> JoinMulti([df0,df1,df2])
                      Value0  Value1  Value2
Season Year Location                        
1      2011 A             43       5      96
            B             63      46      97
       2012 A             68       6      96
            B             23      99      97
2      2011 A             66       5      96
            B             30      46      97
       2012 A             45       6      96
            B             79      99      97
3      2011 A             66       5      96
            B             21      46      97
       2012 A             86       6      96
            B             11      99      97
4      2011 A             95       5      96
            B             58      46      97
       2012 A             32       6      96
            B             80      99      97
>>>

Is this a reasonable approach? Are there any improvements that could be made or any exceptions that I'm forgetting about?

965

asked Dec 03 '15 18:12

AJG519

1 Answers

I modified solution by Stefan Jansen:

def jez(df0,df1,df2):
    df1 = df1.join(df2)
    df0 = df0.reset_index('Season')
    FinalDF = df0.join(df1).set_index('Season', append=True).reorder_levels(['Season', 'Year', 'Location']).sortlevel()
    return FinalDF

print jez(df0,df1,df2)

Timing:

In [41]: %timeit jez(df0,df1,df2)
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 5.02 ms per loop

In [42]: %timeit JoinMulti([df0,df1,df2])
100 loops, best of 3: 9.83 ms per loop

answered Oct 05 '22 19:10

jezrael

Related questions
                            
                                Python : sklearn svm, providing a custom loss function
                            
                                Differentiating an equation
                            
                                Flask: Set header on static files
                            
                                Downloading images with gevent
                            
                                Scrapy: 503 Error when scraping site using CloudFlare
                            
                                Fetching the first image from a website that belongs to the post
                            
                                How to scrape the data from Wikipedia by category?
                            
                                Method replacement at runtime not updating Private attributes
                            
                                python celery - how to add CELERYBEAT_SCHEDULE task at runtime to a worker?
                            
                                How to plot certain rows of a pandas dataframe
                            
                                Pandas quantile function very slow
                            
                                PyMySQL executemany with ON DUPLICATE
                            
                                Logging in a Framework
                            
                                How to find all elements containing the word "download" using Selenium x-path?
                            
                                GitPython Is it possible to get file from specified commit without checkout
                            
                                Create a Proxy using Python to pass NTLM credentials to another Proxy
                            
                                Python Import Star Creating Hidden Namespace?
                            
                                Override data property of serializer in DRF
                            
                                how to change feature weight when training a model with sklearn?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With