Efficient way to combine pandas data frames row-wise

Tags:

I've 14 data frames each with 14 columns and more than 250,000 rows. The data frame have identical column headers and I would like to merge the data frames row-wise. I attempted to concatenate the data frames to a 'growing' DataFrame and it's taking several hours.

Essentially, I was doing something like below 13 times:

DF = pd.DataFrame()
for i in range(13):   
    DF = pd.concat([DF, subDF])

The stackoverflow answer here suggests appending all sub data frames to a list and then concatenating the list of sub data frames.

That sounds like doing something like this:

DF = pd.DataFrame()
lst = [subDF, subDF, subDF....subDF] #up to 13 times
for subDF in lst:
    DF = pd.concat([DF, subDF])

Aren't they the same thing? Perhaps I'm misunderstanding the suggested workflow. Here's what I tested.

import numpy
import pandas as pd
import timeit

def test1():
    "make all subDF and then concatenate them"
    numpy.random.seed(1)
    subDF = pd.DataFrame(numpy.random.rand(1))
    lst = [subDF, subDF, subDF]
    DF = pd.DataFrame()
    for subDF in lst:
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

def test2():
    "add each subDF to the collecitng DF as you're making the subDF"
    numpy.random.seed(1)
    DF = pd.DataFrame()
    for i in range(3):
        subDF = pd.DataFrame(numpy.random.rand(1))
        DF = pd.concat([DF, subDF], axis=0,ignore_index=True)

print('test1() takes {0} sec'.format(timeit.timeit(test1, number=1000)))
print('test2() takes {0} sec'.format(timeit.timeit(test2, number=1000)))

>> Output

test1() takes 12.732409087137057 sec
test2() takes 15.097430311612698 sec

I would appreciate your suggestions on efficient ways to concatenate multiple large data frames row-wise. Thanks!

657

asked Jul 07 '16 13:07

sedeh

1 Answers

Create a list with all your data frames:

dfs = []
for i in range(13):
    df = ... # However it is that you create your dataframes   
    dfs.append(df)

Then concatenate them in one swoop:

merged = pd.concat(dfs) # add ignore_index=True if appropriate

This is a lot faster than your code because it creates exactly 14 dataframes (your original 13 plus merged), while your code creates 26 of them (your original 13 plus 13 intermediate merges).

EDIT:

Here's a variation on your testing code.

import numpy
import pandas as pd
import timeit

def test_gen_time():
    """Create three large dataframes, but don't concatenate them"""
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))

def test_sequential_concat():
    """Create three large dataframes, concatenate them one by one"""
    DF = pd.DataFrame()
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        DF = pd.concat([DF, df], ignore_index=True)

def test_batch_concat():
    """Create three large dataframes, concatenate them at the end"""
    dfs = []
    for i in range(3):
        df = pd.DataFrame(numpy.random.rand(10**6))
        dfs.append(df)
    DF = pd.concat(dfs, ignore_index=True)

print('test_gen_time() takes {0} sec'
          .format(timeit.timeit(test_gen_time, number=200)))
print('test_sequential_concat() takes {0} sec'
          .format(timeit.timeit(test_sequential_concat, number=200)))
print('test_batch_concat() takes {0} sec'
          .format(timeit.timeit(test_batch_concat, number=200)))

Output:

test_gen_time() takes 10.095820872998956 sec
test_sequential_concat() takes 17.144756617000894 sec
test_batch_concat() takes 12.99131180600125 sec

The lion's share corresponds to generating the dataframes. Batch concatenation takes around 2.9 seconds; sequential concatenation takes more than 7 seconds.

190

answered Nov 11 '22 13:11

A. Garcia-Raboso

Related questions
                            
                                pyspark : Convert DataFrame to RDD[string]
                            
                                what does read() in urlopen('http.....').read() do? [urllib]
                            
                                Keras: ImportError: No module named data_utils
                            
                                Stacked bar charts using python matplotlib for positive and negative values
                            
                                BCrypt. How to store salt with python3?
                            
                                Python: Is there a way to plot a "partial" surface plot with Matplotlib?
                            
                                NumPy Broadcasting: Calculating sum of squared differences between two arrays
                            
                                How to fill an area within a polygon in Python using matplotlib?
                            
                                socket.error: [Errno 102] Operation not supported on socket
                            
                                How to set xticks and yticks with my imshow plot?
                            
                                Venn3: How to reposition circles and labels?
                            
                                How to run multiple python file in a folder one after another [duplicate]
                            
                                RabbitMQ pika.exceptions.ConnectionClosed
                            
                                ImportError : cannot import name '_win32stdio'
                            
                                How do I put a circle with annotation in matplotlib?
                            
                                yield(x) vs. (yield(x)): parentheses around yield in python
                            
                                Pass estimator to custom score function via sklearn.metrics.make_scorer
                            
                                How to remove Python tools for Visual Studio (June 2016) update notification? It's already installed
                            
                                how can I translate efficiently a Java code to python? [closed]
                            
                                Array and __rmul__ operator in Python Numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way to combine pandas data frames row-wise

Tags:

python

pandas

numpy

sedeh

People also ask

1 Answers

A. Garcia-Raboso

Recent Activity

Donate For Us