Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Merging dataframes iteratively with pandas





I'm trying to merge two dataframes in pandas, using read_csv. But one of my dataframes (in this example d1) is too big for my computer to handle, so I'm using the iterator argument in read_csv.

Let's say I have two dataframes

d1 = pd.DataFrame({
    "col2": [5,4,3,2,5,43,2,5,6],
    "col3": [10,10,10,10,10,4,10,10,10]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])

d2 = pd.DataFrame({
    "yes/no": [1,0,1,0,1,1,1,0,0]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])

I need to merge them so that each row captures all data for each person, so the equivalent of doing:

pd.concat((d1,d2), axis=1,join="outer")

but since I can't fit d1 into memory, I've been using read_csv (I'm using read_csv because I already processed a huge file and saved it into .csv format, so imagine my dataframe d1 is contained in the file test.csv).

itera = pd.read_csv("test.csv",index_col="index",iterator=True,chunksize=2)

But when I do

for i in itera:
    d2 = pd.concat((d2,i), axis=1,join="outer")

my output is the first dataframe appended by the second dataframe.

My output looks like this:

        col1  col2  col3   yes/no
one     NaN   NaN   NaN     1.0
two     NaN   NaN   NaN     0.0
three   NaN   NaN   NaN     1.0
four    NaN   NaN   NaN     0.0
five    NaN   NaN   NaN     1.0
six     NaN   NaN   NaN     1.0
seven   NaN   NaN   NaN     1.0
eight   NaN   NaN   NaN     0.0
nine    NaN   NaN   NaN     0.0
one     1.0   5.0  10.0     NaN
two     2.0   4.0  10.0     NaN
three   3.0   3.0  10.0     NaN
four    4.0   2.0  10.0     NaN
five    5.0   5.0  10.0     NaN
six     6.0  43.0   4.0     NaN
seven   7.0   2.0  10.0     NaN
eight   8.0   5.0  10.0     NaN
nine    9.0   6.0  10.0     NaN

Hope my question makes sense :)

like image 413
Sune Nutmeg Avatar asked Dec 05 '17 16:12

Sune Nutmeg

People also ask

How do I merge 10 Dataframes in Pandas?

We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .

How do I merge two Dataframes in Pandas based on common column?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.

1 Answers

I think you are looking for combine first method. It basically updates df1 with values from the each chunk in the read_csv iterator.

import pandas as pd
from StringIO import StringIO

d1 = pd.DataFrame({
    "col2": [5,4,3,2,5,43,2,5,6],
    "col3": [10,10,10,10,10,4,10,10,10]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])

#d2 converted to string tho use with pd.read_csv
d2 =  StringIO("""y/n col5
paul 1 
peter 0 
lauren 1 
dave 0 
bill 1 
steve 1
old-man 1
bob 0
tim 0

#For each chunk update d1 with data
for chunk in pd.read_csv(d2, sep = ' ',iterator=True,chunksize=1):
    d1 = d1.combine_first(chunk[['y/n']])
#Number formatting
d1['y/n'] = d1['y/n'].astype(int)

Which returns d1 looking like:

         col1  col2  col3  y/n
bill        5     5    10    1
bob         8     5    10    0
dave        4     2    10    0
lauren      3     3    10    1
old-man     7     2    10    1
paul        1     5    10    1
peter       2     4    10    0
steve       6    43     4    1
tim         9     6    10    0
like image 142
dubbbdan Avatar answered Nov 10 '22 23:11
