I got following warning
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insert
many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, usenewframe = frame.copy()
when I tried to append multiple dataframes like
df1 = pd.DataFrame()
for file in files:
df = pd.read(file)
df['id'] = file
df1 = df1.append(df, ignore_index =True)
where
df['id'] = file
seems to cause the warning. I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.
Thanks,
I tried to create a testing code to duplicate the problem but I don't see Performance Warning with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.
import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
if not os.path.isdir('./data'):
os.mkdir('./data')
files = []
for i in range(num_files):
file = f'./data/{i}.pkl'
pd.DataFrame(
np.random.randint(1, 1_000, (rows, cols))
).to_pickle(file)
files.append(file)
return files
# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning
dfs = []
for file in files:
df = pd.read_pickle(file)
df['id'] = file
dfs.append(df)
dfs = pd.concat(dfs, ignore_index = True)
append
is not an efficient method for this operation. concat
is more appropriate in this situation.
Replace
df1 = df1.append(df, ignore_index =True)
with
pd.concat((df1,df),axis=0)
Details about the differences are in this question: Pandas DataFrame concat vs append
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With