I'm trying to merge two dataframes in pandas, using read_csv. But one of my dataframes (in this example d1) is too big for my computer to handle, so I'm using the iterator argument in read_csv. 
Let's say I have two dataframes
d1 = pd.DataFrame({
    "col1":[1,2,3,4,5,6,7,8,9],
    "col2": [5,4,3,2,5,43,2,5,6],
    "col3": [10,10,10,10,10,4,10,10,10]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])
d2 = pd.DataFrame({
    "yes/no": [1,0,1,0,1,1,1,0,0]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])
I need to merge them so that each row captures all data for each person, so the equivalent of doing:
pd.concat((d1,d2), axis=1,join="outer")
but since I can't fit d1 into memory, I've been using read_csv (I'm using  read_csv because I already processed a huge file and saved it into .csv format, so imagine my dataframe d1 is contained in the file test.csv).
itera = pd.read_csv("test.csv",index_col="index",iterator=True,chunksize=2)
But when I do
for i in itera:
    d2 = pd.concat((d2,i), axis=1,join="outer")
my output is the first dataframe appended by the second dataframe.
My output looks like this:
        col1  col2  col3   yes/no
one     NaN   NaN   NaN     1.0
two     NaN   NaN   NaN     0.0
three   NaN   NaN   NaN     1.0
four    NaN   NaN   NaN     0.0
five    NaN   NaN   NaN     1.0
six     NaN   NaN   NaN     1.0
seven   NaN   NaN   NaN     1.0
eight   NaN   NaN   NaN     0.0
nine    NaN   NaN   NaN     0.0
one     1.0   5.0  10.0     NaN
two     2.0   4.0  10.0     NaN
three   3.0   3.0  10.0     NaN
four    4.0   2.0  10.0     NaN
five    5.0   5.0  10.0     NaN
six     6.0  43.0   4.0     NaN
seven   7.0   2.0  10.0     NaN
eight   8.0   5.0  10.0     NaN
nine    9.0   6.0  10.0     NaN
Hope my question makes sense :)
We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.
I think you are looking for combine first method.  It basically updates df1 with values from the each chunk in the read_csv iterator.
import pandas as pd
from StringIO import StringIO
d1 = pd.DataFrame({
    "col1":[1,2,3,4,5,6,7,8,9],
    "col2": [5,4,3,2,5,43,2,5,6],
    "col3": [10,10,10,10,10,4,10,10,10]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])
#d2 converted to string tho use with pd.read_csv
d2 =  StringIO("""y/n col5
paul 1 
peter 0 
lauren 1 
dave 0 
bill 1 
steve 1
old-man 1
bob 0
tim 0
""")
#For each chunk update d1 with data
for chunk in pd.read_csv(d2, sep = ' ',iterator=True,chunksize=1):
    d1 = d1.combine_first(chunk[['y/n']])
#Number formatting
d1['y/n'] = d1['y/n'].astype(int)
Which returns d1 looking like:
         col1  col2  col3  y/n
bill        5     5    10    1
bob         8     5    10    0
dave        4     2    10    0
lauren      3     3    10    1
old-man     7     2    10    1
paul        1     5    10    1
peter       2     4    10    0
steve       6    43     4    1
tim         9     6    10    0
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With