Efficient way to merge multiple large DataFrames

Q: How do I join two data frames?

Join acts as a method to join two data frames, but it will exclusively work on the index of the right data frame. For the left data frame, either the left index or a column can be selected (we need to use a column in our example since the customer is not unique in our orders data frame).

Q: Why does merge and filter fail with 2 data frames?

Let’s say I have 2 data frames as follow: My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K. The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory. getting stuck.

Tags:

python

merge

pandas

dataframe

out-of-memory

Suppose I have 4 small DataFrames

df1, df2, df3 and df4

import pandas as pd
from functools import reduce
import numpy as np

df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   


df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']    
df4.columns = ['name', 'id', 'price']   

df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})

Create above are the 4 DataFrames, what I would like is in the code below.

# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')

# Fill na values with 'missing'
df = df.fillna('missing')

So I have achieved this for 4 DataFrames that don't have many rows and columns.

Basically, I want to extend the above outer merge solution to MULTIPLE (48) DataFrames of size 62245 X 3:

So I came up with this solution by building from another StackOverflow answer that used a lambda reduce:

from functools import reduce
import pandas as pd
import numpy as np
dfList = []

#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):

    dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name',  'id',  'pricepart' + str(i + 1)]))


#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda  left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')

This is causing a MemoryError.

I do not know what to do to stop the kernel from dying.. I've been stuck on this for two days.. Some code for the EXACT merge operation that I have performed that does not cause the MemoryError or something that gives you the same result, would be really appreciated.

Also, the 3 columns in the main DataFrame (NOT the reproducible 48 DataFrames in the example) are of type int64, int64 and float64 and I'd prefer them to stay that way because of the integer and float that it represents.

EDIT:

Instead of iteratively trying to run the merge operations or using the reduce lambda functions, I have done it in groups of 2! Also, I've changed the datatype of some columns, some did not need to be float64. So I brought it down to float16. It gets very far but still ends up throwing a MemoryError.

intermediatedfList = dfList    

tempdfList = []    

#Until I merge all the 48 frames two at a time, till it becomes size 2
while(len(intermediatedfList) != 2):

    #If there are even number of DataFrames
    if len(intermediatedfList)%2 == 0:

        #Go in steps of two
        for i in range(0, len(intermediatedfList), 2):

            #Merge DataFrame in index i, i + 1
            df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name',  'id'], right_on=['name',  'id'], how='outer')
            print(df1.info(memory_usage='deep'))

            #Append it to this list
            tempdfList.append(df1)

        #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, 
        #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. 
        intermediatedfList = tempdfList 

    else:

        #If there are odd number of DataFrames, keep the first DataFrame out

        tempdfList = [intermediatedfList[0]]

        #Go in steps of two starting from 1 instead of 0
        for i in range(1, len(intermediatedfList), 2):

            #Merge DataFrame in index i, i + 1
            df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name',  'id'], right_on=['name',  'id'], how='outer')
            print(df1.info(memory_usage='deep'))
            tempdfList.append(df1)

        #After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList, 
        #Set intermediatedfList to be equal to tempdfList, so it can continue the while loop. 
        intermediatedfList = tempdfList

Is there any way I can optimize my code to avoid MemoryError, I've even used AWS 192GB RAM (I now owe them 7$ which I could've given one of yall), that gets farther than what I've gotten, and it still throws MemoryError after reducing a list of 28 DataFrames to 4..

381

asked Jun 16 '18 08:06

imperialgendarme

3 Answers

You may get some benefit from performing index-aligned concatenation using pd.concat. This should hopefully be faster and more memory efficient than an outer merge as well.

df_list = [df1, df2, ...]
for df in df_list:
    df.set_index(['name', 'id'], inplace=True)

df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)

Alternatively, you can replace the concat (second step) by an iterative join:

from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)

This may or may not be better than the merge.

198

answered Oct 18 '22 03:10

cs95

Seems like part of what dask dataframes were designed to do (out of memory ops with dataframes). See Best way to join two large datasets in Pandas for example code. Sorry not copying and pasting but don't want to seem like I am trying to take credit from answerer in linked entry.

answered Oct 18 '22 03:10

user85779

You can try a simple for loop. The only memory optimization I have applied is downcasting to most optimal int type via pd.to_numeric.

I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.

import pandas as pd

dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])  
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])   

df = dfs[1].copy()

for i in range(2, max(dfs)+1):
    df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
                  left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
    df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')

print(df)

   0  1   2   3   4   5
0  a  1  10  15  -1  -1
1  a  2  20  20  -1  -1
2  b  1   4  -1  -1  -1
3  c  1   2   2  -1  -1
4  e  2  10  -1  20  20
5  d  1  -1  -1  10  10
6  f  1  -1  -1   1  15

You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object type series. Here we use -1, but you may wish to use NaN with float dtype instead.

answered Oct 18 '22 03:10

jpp

Related questions
                            
                                Parsing dates in pandas.read_csv with null-value handling?
                            
                                Replace multiple characters in a string at once
                            
                                Calculate Distances Between One Point in Matrix From All Other Points
                            
                                How do I insert highlight or code-block into Sphinx-style docstrings?
                            
                                Writing/Reading special characters from CSV (Python 3.6)
                            
                                Sort string with integers and words without any change in their positions
                            
                                How do I set a custom token for a jupyter notebook?
                            
                                What do the values that `graphviz` renders inside each node of a decision tree mean?
                            
                                Deploy a Python (Dash) app to Heroku using Conda environments (instead of virtualenv)
                            
                                Docker image with python3, chromedriver, chrome & selenium
                            
                                How to use static type checking using Dict with different value types in Python 3.6?
                            
                                ImportError: Failed to import the Cloud Firestore library for Python
                            
                                All intermediate steps should be transformers and implement fit and transform
                            
                                "ValueError: Not a location id (Invalid object id)" while creating HDF5 datasets
                            
                                Need python dictionary to act like deque (have maximum length)
                            
                                Facebook prophet, non daily data in Python
                            
                                how to commit changes on gremlin server using gremlin python
                            
                                How to restrict output of a neural net to a specific range?
                            
                                determine from which file a function is defined in python
                            
                                Formatting a string with a namedtuple

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way to merge multiple large DataFrames

Tags:

python

merge

pandas

dataframe

out-of-memory

imperialgendarme

People also ask

3 Answers

cs95

user85779

jpp

Recent Activity

Donate For Us