Improve Row Append Performance On Pandas DataFrames

Tags:

I am running a basic script that loops over a nested dictionary, grabs data from each record, and appends it to a Pandas DataFrame. The data looks something like this:

data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...}

In total it has a few million records. The script itself looks like this:

city = ["SomeCity"] df = DataFrame({}, columns=['Date', 'HouseID', 'Price']) for city in cities:     for dateRun in data[city]:         for record in data[city][dateRun]:             recSeries = Series([record['Timestamp'],                                  record['Id'],                                  record['Price']],                                 index = ['Date', 'HouseID', 'Price'])             FredDF = FredDF.append(recSeries, ignore_index=True)

This runs painfully slow, however. Before I look for a way to parallelize it, I just want to make sure I'm not missing something obvious that would make this perform faster as it is, as I'm still quite new to Pandas.

706

asked Jan 13 '15 18:01

Brideau

2 Answers

I also used the dataframe's append function inside a loop and I was perplexed how slow it ran.

A useful example for those who are suffering, based on the correct answer on this page.

Python version: 3

Pandas version: 0.20.3

# the dictionary to pass to pandas dataframe d = {}  # a counter to use to add entries to "dict" i = 0   # Example data to loop and append to a dataframe data = [{"foo": "foo_val_1", "bar": "bar_val_1"},         {"foo": "foo_val_2", "bar": "bar_val_2"}]  # the loop for entry in data:      # add a dictionary entry to the final dictionary     d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}          # increment the counter     i = i + 1  # create the dataframe using 'from_dict' # important to set the 'orient' parameter to "index" to make the keys as rows df = DataFrame.from_dict(d, "index")

The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

124

answered Oct 13 '22 06:10

P-S

Appending rows to lists is far more efficient than to a DataFrame. Hence you would want to

append the rows to a list.
Then convert it into DataFrame and
set the index as required.

answered Oct 13 '22 05:10

Mahidhar Surapaneni

Related questions
                            
                                Sort nested dictionary by value, and remainder by another value, in Python
                            
                                How to group elements in python by n elements [duplicate]
                            
                                Set partitions in Python
                            
                                GroupBy column and filter rows with maximum value in Pyspark
                            
                                Checking dict keys to ensure a required key always exists, and that the dict has no other key names beyond a defined set of names
                            
                                What is the difference between raise StopIteration and a return statement in generators?
                            
                                What exactly is contained within a obj.__closure__?
                            
                                Python finite difference functions?
                            
                                vlookup in Pandas using join
                            
                                Matplotlib giving error "OverflowError: In draw_path: Exceeded cell block limit"
                            
                                How to transform Dask.DataFrame to pd.DataFrame?
                            
                                Using plt.imshow() to display multiple images
                            
                                Simple wrapping of C code with cython
                            
                                How to access a dictionary key value present inside a list?
                            
                                How to send periodic tasks to specific queue in Celery
                            
                                Comparing two date strings in Python
                            
                                How to use re match objects in a list comprehension
                            
                                Pandas Pivot tables row subtotals
                            
                                Django Serializer Method Field
                            
                                what's Python asyncio.Lock() for?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improve Row Append Performance On Pandas DataFrames

Tags:

python

pandas

numpy

python-2.7

Brideau

People also ask

2 Answers

P-S

Mahidhar Surapaneni

Recent Activity

Donate For Us