Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas - Using list comprehension to concat data frames

In the pandas documentation, it states:

It is worth noting however, that concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a signifcant performance hit. If you need to use the operation over several datasets, use a list comprehension.

frames = [ process_your_file(f) for f in files ]

result = pd.concat(frames)

My current situation is that I will be concatenating a new dataframe to a growing list of data frames over and over. This will result in a horrifying number of concatenations.

I'm worried about performance, and I'm not sure how to make use of list comprehension in this case. My code is as follows.

df = first_data_frame
while verify == True:
    # download data (new data becomes available through each iteration)
    # then turn [new] data into data frame, called 'temp'
    frames = [df, temp]
    df = concat(frames)
    if condition_met:
        verify == False

I don't think the parts that download data and create the data frame are relevant; my concern is with the constant concatenation.

How do I implement list comprehension in this case?

like image 254
TheRealFakeNews Avatar asked Oct 08 '15 04:10

TheRealFakeNews


2 Answers

If you have a loop that can't be put into a list comprehension (like a while loop), you can initialize an empty list at the top, then append to it during the while loop. Example:

frames = []
while verify:
    # download data
    # temp = pd.DataFrame(data)
    frames.append(temp)
    if condition_met:
        verify = False

pd.concat(frames)

You can also put the loop in a generator function, and then use a list comprehension, but that might be more complicated than you need.

Also, if your data comes naturally as a list of dicts or something like that, you may not need to create all the temporary dataframes - just append all of your data into one giant list of dicts, and then convert that to a dataframe in one call at the very end.

like image 158
user2034412 Avatar answered Sep 20 '22 12:09

user2034412


List comprehension is very fast and elegant. I also had to chain together many different dataframes from a list. This is my code:

import os
import pandas as pd
import numpy as np

# FileNames is a list with the names of the csv files contained in the 'dataset' path

FileNames = []
for files in os.listdir("dataset"):
    if files.endswith(".csv"):
        FileNames.append(files)

# function that reads the file from the FileNames list and makes it become a dataFrame

def GetFile(fnombre):
location = 'dataset/' + fnombre
df = pd.read_csv(location)
return df

# list comprehension
df = [GetFile(file) for file in FileNames]
dftot = pd.concat(df)

The result is a dataFrame of over one million rows (8 columns) created in 3 seconds, on my i3.

if you replace the two lines of code "list comprehension" with these, you will notice a deterioration in performance:

dftot = pd.DataFrame()
for file in FileNames:
    df = GetFile(file)
    dftot = pd.concat([dftot, df])

to insert an 'IF' condition to your code, change the line:

df = [GetFile(file) for file in FileNames]

in this way for example:

df = [GetFile(file) for file in FileNames if file == 'A.csv']

this code reads the 'A.csv' file only

like image 26
Giuseppe Bellisano Avatar answered Sep 22 '22 12:09

Giuseppe Bellisano