I'm trying to merge two dataframes in pandas, using read_csv. But one of my dataframes (in this example d1
) is too big for my computer to handle, so I'm using the iterator
argument in read_csv
.
Let's say I have two dataframes
d1 = pd.DataFrame({
"col1":[1,2,3,4,5,6,7,8,9],
"col2": [5,4,3,2,5,43,2,5,6],
"col3": [10,10,10,10,10,4,10,10,10]},
index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])
d2 = pd.DataFrame({
"yes/no": [1,0,1,0,1,1,1,0,0]},
index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])
I need to merge them so that each row captures all data for each person, so the equivalent of doing:
pd.concat((d1,d2), axis=1,join="outer")
but since I can't fit d1 into memory, I've been using read_csv
(I'm using read_csv
because I already processed a huge file and saved it into .csv format, so imagine my dataframe d1 is contained in the file test.csv
).
itera = pd.read_csv("test.csv",index_col="index",iterator=True,chunksize=2)
But when I do
for i in itera:
d2 = pd.concat((d2,i), axis=1,join="outer")
my output is the first dataframe appended by the second dataframe.
My output looks like this:
col1 col2 col3 yes/no
one NaN NaN NaN 1.0
two NaN NaN NaN 0.0
three NaN NaN NaN 1.0
four NaN NaN NaN 0.0
five NaN NaN NaN 1.0
six NaN NaN NaN 1.0
seven NaN NaN NaN 1.0
eight NaN NaN NaN 0.0
nine NaN NaN NaN 0.0
one 1.0 5.0 10.0 NaN
two 2.0 4.0 10.0 NaN
three 3.0 3.0 10.0 NaN
four 4.0 2.0 10.0 NaN
five 5.0 5.0 10.0 NaN
six 6.0 43.0 4.0 NaN
seven 7.0 2.0 10.0 NaN
eight 8.0 5.0 10.0 NaN
nine 9.0 6.0 10.0 NaN
Hope my question makes sense :)
We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.
I think you are looking for combine first method. It basically updates df1
with values from the each chunk in the read_csv
iterator.
import pandas as pd
from StringIO import StringIO
d1 = pd.DataFrame({
"col1":[1,2,3,4,5,6,7,8,9],
"col2": [5,4,3,2,5,43,2,5,6],
"col3": [10,10,10,10,10,4,10,10,10]},
index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])
#d2 converted to string tho use with pd.read_csv
d2 = StringIO("""y/n col5
paul 1
peter 0
lauren 1
dave 0
bill 1
steve 1
old-man 1
bob 0
tim 0
""")
#For each chunk update d1 with data
for chunk in pd.read_csv(d2, sep = ' ',iterator=True,chunksize=1):
d1 = d1.combine_first(chunk[['y/n']])
#Number formatting
d1['y/n'] = d1['y/n'].astype(int)
Which returns d1
looking like:
col1 col2 col3 y/n
bill 5 5 10 1
bob 8 5 10 0
dave 4 2 10 0
lauren 3 3 10 1
old-man 7 2 10 1
paul 1 5 10 1
peter 2 4 10 0
steve 6 43 4 1
tim 9 6 10 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With