Merging dataframes iteratively with pandas

Tags:

I'm trying to merge two dataframes in pandas, using read_csv. But one of my dataframes (in this example d1) is too big for my computer to handle, so I'm using the iterator argument in read_csv.

Let's say I have two dataframes

d1 = pd.DataFrame({
    "col1":[1,2,3,4,5,6,7,8,9],
    "col2": [5,4,3,2,5,43,2,5,6],
    "col3": [10,10,10,10,10,4,10,10,10]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])

d2 = pd.DataFrame({
    "yes/no": [1,0,1,0,1,1,1,0,0]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])

I need to merge them so that each row captures all data for each person, so the equivalent of doing:

pd.concat((d1,d2), axis=1,join="outer")

but since I can't fit d1 into memory, I've been using read_csv (I'm using read_csv because I already processed a huge file and saved it into .csv format, so imagine my dataframe d1 is contained in the file test.csv).

itera = pd.read_csv("test.csv",index_col="index",iterator=True,chunksize=2)

But when I do

for i in itera:
    d2 = pd.concat((d2,i), axis=1,join="outer")

my output is the first dataframe appended by the second dataframe.

My output looks like this:

        col1  col2  col3   yes/no
one     NaN   NaN   NaN     1.0
two     NaN   NaN   NaN     0.0
three   NaN   NaN   NaN     1.0
four    NaN   NaN   NaN     0.0
five    NaN   NaN   NaN     1.0
six     NaN   NaN   NaN     1.0
seven   NaN   NaN   NaN     1.0
eight   NaN   NaN   NaN     0.0
nine    NaN   NaN   NaN     0.0
one     1.0   5.0  10.0     NaN
two     2.0   4.0  10.0     NaN
three   3.0   3.0  10.0     NaN
four    4.0   2.0  10.0     NaN
five    5.0   5.0  10.0     NaN
six     6.0  43.0   4.0     NaN
seven   7.0   2.0  10.0     NaN
eight   8.0   5.0  10.0     NaN
nine    9.0   6.0  10.0     NaN

Hope my question makes sense :)

413

asked Dec 05 '17 16:12

Sune Nutmeg

1 Answers

I think you are looking for combine first method. It basically updates df1 with values from the each chunk in the read_csv iterator.

import pandas as pd
from StringIO import StringIO

d1 = pd.DataFrame({
    "col1":[1,2,3,4,5,6,7,8,9],
    "col2": [5,4,3,2,5,43,2,5,6],
    "col3": [10,10,10,10,10,4,10,10,10]}, 
    index=["paul", "peter", "lauren", "dave", "bill", "steve", "old-man", "bob", "tim"])


#d2 converted to string tho use with pd.read_csv
d2 =  StringIO("""y/n col5
paul 1 
peter 0 
lauren 1 
dave 0 
bill 1 
steve 1
old-man 1
bob 0
tim 0
""")

#For each chunk update d1 with data
for chunk in pd.read_csv(d2, sep = ' ',iterator=True,chunksize=1):
    d1 = d1.combine_first(chunk[['y/n']])
#Number formatting
d1['y/n'] = d1['y/n'].astype(int)

Which returns d1 looking like:

         col1  col2  col3  y/n
bill        5     5    10    1
bob         8     5    10    0
dave        4     2    10    0
lauren      3     3    10    1
old-man     7     2    10    1
paul        1     5    10    1
peter       2     4    10    0
steve       6    43     4    1
tim         9     6    10    0

142

answered Nov 10 '22 23:11

dubbbdan

Related questions
                            
                                Pickling dynamically created types
                            
                                googletrans python api - No JSON object could be decoded error
                            
                                Calling a Javascript function from flask / python
                            
                                How to run MPI compatible applications from Jupyter notebooks?
                            
                                How to pass JSON web token (JWT) to a get request
                            
                                How to import one submodule from different submodule? [duplicate]
                            
                                Modify arguments to typing.NamedTuple prior to instance creation
                            
                                "pip uninstall jupyter" does not work but, "which jupyter" returns a valid path [duplicate]
                            
                                rpy2 error after installing r package
                            
                                how to use gpg encrypted oauth files via Python for offlineimap
                            
                                Couldn't build proto file into descriptor pool
                            
                                Passing NamedTemporaryFile to a subprocess on windows
                            
                                Excel - Power query data refresh via Python
                            
                                pandas.DataFrame.rolling not working with huge floats
                            
                                Flake8 reports E999 SyntaxError atom Flake 8
                            
                                Word list generation (sorting, optimization)
                            
                                ProgrammingError at "url" relation "app_model" does not exist LINE 1: SELECT COUNT(*) AS "__count" FROM "app_model"
                            
                                M2Crypto bad performance to decrypt and verify big email
                            
                                Adding into Path var while silent installation of Python - possible bug?
                            
                                Gaussian Process Posterior (Python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Merging dataframes iteratively with pandas

Tags:

python

pandas

csv

Sune Nutmeg

People also ask

1 Answers

dubbbdan

Recent Activity

Donate For Us