Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve Row Append Performance On Pandas DataFrames

I am running a basic script that loops over a nested dictionary, grabs data from each record, and appends it to a Pandas DataFrame. The data looks something like this:

data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...} 

In total it has a few million records. The script itself looks like this:

city = ["SomeCity"] df = DataFrame({}, columns=['Date', 'HouseID', 'Price']) for city in cities:     for dateRun in data[city]:         for record in data[city][dateRun]:             recSeries = Series([record['Timestamp'],                                  record['Id'],                                  record['Price']],                                 index = ['Date', 'HouseID', 'Price'])             FredDF = FredDF.append(recSeries, ignore_index=True) 

This runs painfully slow, however. Before I look for a way to parallelize it, I just want to make sure I'm not missing something obvious that would make this perform faster as it is, as I'm still quite new to Pandas.

like image 706
Brideau Avatar asked Jan 13 '15 18:01

Brideau


People also ask

Is appending to DataFrame slow?

Appending rows to DataFrames is inherently inefficient. Try to create the entire DataFrame with its final size in one go.

Which is faster pandas concat or append?

In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.

Why append is deprecated?

append was deprecated because: "Series. append and DataFrame. append [are] making an analogy to list. append, but it's a poor analogy since the behavior isn't (and can't be) in place.


2 Answers

I also used the dataframe's append function inside a loop and I was perplexed how slow it ran.

A useful example for those who are suffering, based on the correct answer on this page.

Python version: 3

Pandas version: 0.20.3

# the dictionary to pass to pandas dataframe d = {}  # a counter to use to add entries to "dict" i = 0   # Example data to loop and append to a dataframe data = [{"foo": "foo_val_1", "bar": "bar_val_1"},         {"foo": "foo_val_2", "bar": "bar_val_2"}]  # the loop for entry in data:      # add a dictionary entry to the final dictionary     d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}          # increment the counter     i = i + 1  # create the dataframe using 'from_dict' # important to set the 'orient' parameter to "index" to make the keys as rows df = DataFrame.from_dict(d, "index") 

The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

like image 124
P-S Avatar answered Oct 13 '22 06:10

P-S


Appending rows to lists is far more efficient than to a DataFrame. Hence you would want to

  1. append the rows to a list.
  2. Then convert it into DataFrame and
  3. set the index as required.
like image 33
Mahidhar Surapaneni Avatar answered Oct 13 '22 05:10

Mahidhar Surapaneni