Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to combine pandas data frames row-wise

I've 14 data frames each with 14 columns and more than 250,000 rows. The data frame have identical column headers and I would like to merge the data frames row-wise. I attempted to concatenate the data frames to a 'growing' DataFrame and it's taking several hours.

Essentially, I was doing something like below 13 times:

DF = pd.DataFrame()
for i in range(13):   
    DF = pd.concat([DF, subDF])

The stackoverflow answer here suggests appending all sub data frames to a list and then concatenating the list of sub data frames.

That sounds like doing something like this:

DF = pd.DataFrame()
lst = [subDF, subDF, subDF....subDF] #up to 13 times
for subDF in lst:
    DF = pd.concat([DF, subDF])

Aren't they the same thing? Perhaps I'm misunderstanding the suggested workflow. Here's what I tested.

import numpy
import pandas as pd
import timeit

def test1():
    "make all subDF and then concatenate them"
    numpy.random.seed(1)
    subDF = pd.DataFrame(numpy.random.rand(1))
    lst = [subDF, subDF, subDF]
    DF = pd.DataFrame()
    for subDF in lst:
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

def test2():
    "add each subDF to the collecitng DF as you're making the subDF"
    numpy.random.seed(1)
    DF = pd.DataFrame()
    for i in range(3):
        subDF = pd.DataFrame(numpy.random.rand(1))
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

print('test1() takes {0} sec'.format(timeit.timeit(test1, number=1000)))
print('test2() takes {0} sec'.format(timeit.timeit(test2, number=1000)))

>> Output

test1() takes 12.732409087137057 sec
test2() takes 15.097430311612698 sec

I would appreciate your suggestions on efficient ways to concatenate multiple large data frames row-wise. Thanks!

like image 657
sedeh Avatar asked Jul 07 '16 13:07

sedeh


People also ask

How do I merge DataFrames in row wise?

concat() to Merge Two DataFrames by Index. You can concatenate two DataFrames by using pandas. concat() method by setting axis=1 , and by default, pd. concat is a row-wise outer join.

How do I merge two rows of DataFrames in pandas?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

How do I merge 3 data frames in pandas?

We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .

Is pandas merge efficient?

A merge is also just as efficient as a join as long as: Merging is done on indexes if possible. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).


1 Answers

Create a list with all your data frames:

dfs = []
for i in range(13):
    df = ... # However it is that you create your dataframes   
    dfs.append(df)

Then concatenate them in one swoop:

merged = pd.concat(dfs) # add ignore_index=True if appropriate

This is a lot faster than your code because it creates exactly 14 dataframes (your original 13 plus merged), while your code creates 26 of them (your original 13 plus 13 intermediate merges).

EDIT:

Here's a variation on your testing code.

import numpy
import pandas as pd
import timeit

def test_gen_time():
    """Create three large dataframes, but don't concatenate them"""
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))

def test_sequential_concat():
    """Create three large dataframes, concatenate them one by one"""
    DF = pd.DataFrame()
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        DF = pd.concat([DF, df], ignore_index=True)

def test_batch_concat():
    """Create three large dataframes, concatenate them at the end"""
    dfs = []
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        dfs.append(df)
    DF = pd.concat(dfs, ignore_index=True)

print('test_gen_time() takes {0} sec'
          .format(timeit.timeit(test_gen_time, number=200)))
print('test_sequential_concat() takes {0} sec'
          .format(timeit.timeit(test_sequential_concat, number=200)))
print('test_batch_concat() takes {0} sec'
          .format(timeit.timeit(test_batch_concat, number=200)))

Output:

test_gen_time() takes 10.095820872998956 sec
test_sequential_concat() takes 17.144756617000894 sec
test_batch_concat() takes 12.99131180600125 sec

The lion's share corresponds to generating the dataframes. Batch concatenation takes around 2.9 seconds; sequential concatenation takes more than 7 seconds.

like image 190
A. Garcia-Raboso Avatar answered Nov 11 '22 13:11

A. Garcia-Raboso