I've 14 data frames each with 14 columns and more than 250,000 rows. The data frame have identical column headers and I would like to merge the data frames row-wise. I attempted to concatenate the data frames to a 'growing' DataFrame and it's taking several hours.
Essentially, I was doing something like below 13 times:
DF = pd.DataFrame()
for i in range(13):
DF = pd.concat([DF, subDF])
The stackoverflow answer here suggests appending all sub data frames to a list and then concatenating the list of sub data frames.
That sounds like doing something like this:
DF = pd.DataFrame()
lst = [subDF, subDF, subDF....subDF] #up to 13 times
for subDF in lst:
DF = pd.concat([DF, subDF])
Aren't they the same thing? Perhaps I'm misunderstanding the suggested workflow. Here's what I tested.
import numpy
import pandas as pd
import timeit
def test1():
"make all subDF and then concatenate them"
numpy.random.seed(1)
subDF = pd.DataFrame(numpy.random.rand(1))
lst = [subDF, subDF, subDF]
DF = pd.DataFrame()
for subDF in lst:
DF = pd.concat([DF, subDF], axis=0,ignore_index=True)
def test2():
"add each subDF to the collecitng DF as you're making the subDF"
numpy.random.seed(1)
DF = pd.DataFrame()
for i in range(3):
subDF = pd.DataFrame(numpy.random.rand(1))
DF = pd.concat([DF, subDF], axis=0,ignore_index=True)
print('test1() takes {0} sec'.format(timeit.timeit(test1, number=1000)))
print('test2() takes {0} sec'.format(timeit.timeit(test2, number=1000)))
>> Output
test1() takes 12.732409087137057 sec
test2() takes 15.097430311612698 sec
I would appreciate your suggestions on efficient ways to concatenate multiple large data frames row-wise. Thanks!
concat() to Merge Two DataFrames by Index. You can concatenate two DataFrames by using pandas. concat() method by setting axis=1 , and by default, pd. concat is a row-wise outer join.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .
A merge is also just as efficient as a join as long as: Merging is done on indexes if possible. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).
Create a list with all your data frames:
dfs = []
for i in range(13):
df = ... # However it is that you create your dataframes
dfs.append(df)
Then concatenate them in one swoop:
merged = pd.concat(dfs) # add ignore_index=True if appropriate
This is a lot faster than your code because it creates exactly 14 dataframes (your original 13 plus merged
), while your code creates 26 of them (your original 13 plus 13 intermediate merges).
EDIT:
Here's a variation on your testing code.
import numpy
import pandas as pd
import timeit
def test_gen_time():
"""Create three large dataframes, but don't concatenate them"""
for i in range(3):
df = pd.DataFrame(numpy.random.rand(10**6))
def test_sequential_concat():
"""Create three large dataframes, concatenate them one by one"""
DF = pd.DataFrame()
for i in range(3):
df = pd.DataFrame(numpy.random.rand(10**6))
DF = pd.concat([DF, df], ignore_index=True)
def test_batch_concat():
"""Create three large dataframes, concatenate them at the end"""
dfs = []
for i in range(3):
df = pd.DataFrame(numpy.random.rand(10**6))
dfs.append(df)
DF = pd.concat(dfs, ignore_index=True)
print('test_gen_time() takes {0} sec'
.format(timeit.timeit(test_gen_time, number=200)))
print('test_sequential_concat() takes {0} sec'
.format(timeit.timeit(test_sequential_concat, number=200)))
print('test_batch_concat() takes {0} sec'
.format(timeit.timeit(test_batch_concat, number=200)))
Output:
test_gen_time() takes 10.095820872998956 sec
test_sequential_concat() takes 17.144756617000894 sec
test_batch_concat() takes 12.99131180600125 sec
The lion's share corresponds to generating the dataframes. Batch concatenation takes around 2.9 seconds; sequential concatenation takes more than 7 seconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With