I am running a basic script that loops over a nested dictionary, grabs data from each record, and appends it to a Pandas DataFrame. The data looks something like this:
data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...}
In total it has a few million records. The script itself looks like this:
city = ["SomeCity"] df = DataFrame({}, columns=['Date', 'HouseID', 'Price']) for city in cities: for dateRun in data[city]: for record in data[city][dateRun]: recSeries = Series([record['Timestamp'], record['Id'], record['Price']], index = ['Date', 'HouseID', 'Price']) FredDF = FredDF.append(recSeries, ignore_index=True)
This runs painfully slow, however. Before I look for a way to parallelize it, I just want to make sure I'm not missing something obvious that would make this perform faster as it is, as I'm still quite new to Pandas.
Appending rows to DataFrames is inherently inefficient. Try to create the entire DataFrame with its final size in one go.
In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.
append was deprecated because: "Series. append and DataFrame. append [are] making an analogy to list. append, but it's a poor analogy since the behavior isn't (and can't be) in place.
I also used the dataframe's append function inside a loop and I was perplexed how slow it ran.
A useful example for those who are suffering, based on the correct answer on this page.
Python version: 3
Pandas version: 0.20.3
# the dictionary to pass to pandas dataframe d = {} # a counter to use to add entries to "dict" i = 0 # Example data to loop and append to a dataframe data = [{"foo": "foo_val_1", "bar": "bar_val_1"}, {"foo": "foo_val_2", "bar": "bar_val_2"}] # the loop for entry in data: # add a dictionary entry to the final dictionary d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']} # increment the counter i = i + 1 # create the dataframe using 'from_dict' # important to set the 'orient' parameter to "index" to make the keys as rows df = DataFrame.from_dict(d, "index")
The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html
Appending rows to lists is far more efficient than to a DataFrame
. Hence you would want to
DataFrame
and If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With