I'm using the following function to concatenate a large number of CSV files:
def concatenate():
files = sort() # input is an array of filenames
merged = pd.DataFrame()
for file in files:
print "concatinating" + file
if file.endswith('FulltimeSimpleOpt.csv'): # only consider those filenames
filenamearray = file.split("_")
f = pd.read_csv(file, index_col=0)
f.loc[:,'Vehicle'] = filenamearray[0].replace("veh", "")
f.loc[:,'Year'] = filenamearray[1].replace("year", "")
if "timelimit" in file:
f.loc[:,'Timelimit'] = "1"
else:
f.loc[:,'Timelimit'] = "0"
merged = pd.concat([merged, f], axis=0)
merged.to_csv('merged.csv')
The problem with this function is that it doesn't handle large numbers of files (30,000) well. I tried using a sample of 100 files which finishes properly. However, for the 30,000 files the script slows down and crashes at some point.
How can I handle large numbers of files better in Python Pandas?
make a list of dfs first and then concatenate:
def concatenate():
files = sort() # input is an array of filenames
df_list =[]
#merged = pd.DataFrame()
for file in files:
print "concatinating" + file
if file.endswith('FulltimeSimpleOpt.csv'): # only consider those filenames
filenamearray = file.split("_")
f = pd.read_csv(file, index_col=0)
f.loc[:,'Vehicle'] = filenamearray[0].replace("veh", "")
f.loc[:,'Year'] = filenamearray[1].replace("year", "")
if "timelimit" in file:
f.loc[:,'Timelimit'] = "1"
else:
f.loc[:,'Timelimit'] = "0"
df_list.append(f)
merged = pd.concat(df_list, axis=0)
merged.to_csv('merged.csv')
What you're doing is incrementally growing your df by repeatedly concatenating, it's more optimal to make a list of dfs and then concatenate all of them in one go
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With