Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance

Tags:

python

pandas

I got following warning

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use newframe = frame.copy()

when I tried to append multiple dataframes like

df1 = pd.DataFrame()
for file in files:
  df = pd.read(file)
  df['id'] = file
  df1 = df1.append(df, ignore_index =True)

where

  df['id'] = file

seems to cause the warning. I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.

Thanks,


I tried to create a testing code to duplicate the problem but I don't see Performance Warning with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.

import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
    if not os.path.isdir('./data'):
        os.mkdir('./data')
        files = []
        for i in range(num_files):
            file = f'./data/{i}.pkl'
            pd.DataFrame(
                np.random.randint(1, 1_000, (rows, cols))
            ).to_pickle(file)
            files.append(file)
    return files

# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning

dfs = []
for file in files:
    df = pd.read_pickle(file)
    df['id'] = file

    dfs.append(df)

dfs = pd.concat(dfs, ignore_index = True)
like image 671
Chung-Kan Huang Avatar asked Nov 07 '22 01:11

Chung-Kan Huang


1 Answers

append is not an efficient method for this operation. concat is more appropriate in this situation.

Replace

df1 = df1.append(df, ignore_index =True)

with

 pd.concat((df1,df),axis=0)

Details about the differences are in this question: Pandas DataFrame concat vs append

like image 108
Polkaguy6000 Avatar answered Nov 15 '22 09:11

Polkaguy6000