In the pandas documentation, it states:
It is worth noting however, that concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a signifcant performance hit. If you need to use the operation over several datasets, use a list comprehension.
frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)
My current situation is that I will be concatenating a new dataframe to a growing list of data frames over and over. This will result in a horrifying number of concatenations.
I'm worried about performance, and I'm not sure how to make use of list comprehension in this case. My code is as follows.
df = first_data_frame
while verify == True:
# download data (new data becomes available through each iteration)
# then turn [new] data into data frame, called 'temp'
frames = [df, temp]
df = concat(frames)
if condition_met:
verify == False
I don't think the parts that download data and create the data frame are relevant; my concern is with the constant concatenation.
How do I implement list comprehension in this case?
If you have a loop that can't be put into a list comprehension (like a while loop), you can initialize an empty list at the top, then append to it during the while loop. Example:
frames = []
while verify:
# download data
# temp = pd.DataFrame(data)
frames.append(temp)
if condition_met:
verify = False
pd.concat(frames)
You can also put the loop in a generator function, and then use a list comprehension, but that might be more complicated than you need.
Also, if your data comes naturally as a list of dicts or something like that, you may not need to create all the temporary dataframes - just append all of your data into one giant list of dicts, and then convert that to a dataframe in one call at the very end.
List comprehension is very fast and elegant. I also had to chain together many different dataframes from a list. This is my code:
import os
import pandas as pd
import numpy as np
# FileNames is a list with the names of the csv files contained in the 'dataset' path
FileNames = []
for files in os.listdir("dataset"):
if files.endswith(".csv"):
FileNames.append(files)
# function that reads the file from the FileNames list and makes it become a dataFrame
def GetFile(fnombre):
location = 'dataset/' + fnombre
df = pd.read_csv(location)
return df
# list comprehension
df = [GetFile(file) for file in FileNames]
dftot = pd.concat(df)
The result is a dataFrame of over one million rows (8 columns) created in 3 seconds, on my i3.
if you replace the two lines of code "list comprehension" with these, you will notice a deterioration in performance:
dftot = pd.DataFrame()
for file in FileNames:
df = GetFile(file)
dftot = pd.concat([dftot, df])
to insert an 'IF' condition to your code, change the line:
df = [GetFile(file) for file in FileNames]
in this way for example:
df = [GetFile(file) for file in FileNames if file == 'A.csv']
this code reads the 'A.csv' file only
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With