Pandas: Why should appending to a dataframe of floats and ints be slower than if its full of NaN

Tags:

I am taking data from a file that receives data from Interactive Brokers 5-second OHLCVT bars via Sierra Chart.

Following advice in earlier posts, rather than append each new row to the dataframe I construct a dataframe with the historical file and append 5000 "blank" records to it with correct timestamps. I then write each new row over a blank row, filling any rows if timestamps are missing and updating pointers.

This works well. Here's the current classes and functions. My initial version created 5000 lines of NaNs (OHLCVxyz). I thought it would be tidier to start with the end data types so converted the "blank" records to zeros with OHLC being floats and Vxyz being ints using:

dg.iloc[0:5000] = 0.0
dg[[v, x, y, z]] = dg[[v, x, y, z]].astype('int')

This only occurs once per additional 5000 lines (once a day for HSI). What surprised me was the impact on the read/write loops. They went from 0.8ms to 3.4ms per row. The only change was from NaNs to zeros.

This picture shows an inital run with a zero filled frame (see timestats 0.0038) then a run with a NaN filled frame (timestats 0.0008).

Can anyone provide insight on why it might add so much time to write to fields of [0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0] instead of [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN] ?

Any thoughts on code improvements are also welcome. :)

Thanks

EDIT +17 hours

Following the questions from @BrenBarn I built a simpler model that could be run by anyone without data. In doing so I eliminated the question of whether the NaNs impact it. In this version I was able to write 0.0s to both versions and the difference was the same:

the array that has 8 columns of floats is added to 10 times faster than the array that has 4 columns of floats and 4 of int64s.
in each case the row being added was [1.0, 2.0, 3.0, 4.0, 5, 6, 7, 8]
the add is done 10000 times with self.df.iloc[self.end] = datarow and increment end.

So, unless I'm mistaken (always possible) it seems that adding to a dataframe with 4 columns of floats and 4 of ints takes 10x as long. Is this an issue for pandas or just what one should expect?

Here's the test code and here is the output picture

I think that having an array of 350,000 rows of 8 columns before you add to it makes a significant difference. My initial tests adding to 10 rows showed no impact - I must go back and retest them.

EDIT +10 minutes

No I went back and created the intial array with only 10 rows and the impact on the add loops didn't change so its not the size of the original array/dataframe. Its probable that in my earlier test I thought I'd converted the columns to ints but I hadn't - checking this proved that the command I thought would do this didn't.

da = SierraFrame(range(10), np.zeros((10,8)))
da.extend_frame1()

EDIT and Possible Answer +35 minutes

Should this question not be answered in more detail.

At this point, my hypothesis is that the underlying functionality to add [1.0, 2.0, 3.0, 4.0, 5, 6, 7, 8] to a spare line in the dataframe is different if the df comprises all one type than if it comprises columns of floats and ints. I just tested it with all int64s and the average add was 0.41ms vs 0.37ms for all floats and 2.8ms for a mixed dataframe. Int8s took 0.39ms. I guess that the mix affects pandas ability to optimize its action so if efficiency is very important then a dataframe with all columns being the same type (float64 probably) is the best bet.

Tests conducted on Linux x64 with Python 3.3.1

322

asked Jun 17 '13 06:06

John 9631

1 Answers

As described in this blog post by the main author of pandas, a pandas DataFrame is internally made up of "blocks". A block is a group of columns all having the same datatype. Each block is stored as a numpy array of its block type. So if you have five int columns and then five float columns, there will be an int block and a float block.

Appending to a multi-type array requires appending to each of the underlying numpy arrays. Appending to numpy arrays is slow, because it requires creating a whole new numpy array. So it makes sense that appending to a multi-type DataFrame is slow: if all the columns are one type, it only has to create one new numpy array, but if they're different types, it has to create multiple new numpy arrays.

It is true that keeping the data all the same type will speed this up. However, I would say the main conclusion is not "if efficiency is important, keep all your columns the same type". The conclusion is if efficiency is important, do not try to append to your arrays/DataFrames.

This is just how numpy works. The slowest part of working with numpy arrays is creating them in the first place. They have a fixed size, and when you "append" to one, you really are just creating an entirely new one with the new size, which is slow. If you absolutely must append to them, you can try stuff like messing with types to ease the pain somewhat. But ultimately you just have to accept that any time you try to append to a DataFrame (or a numpy array in general), you will likely suffer a substantial performance hit.

answered Sep 28 '22 05:09

BrenBarn

Related questions
                            
                                How to properly handle Redis connections with Python / Python RQ?
                            
                                keep matplotlib / pyplot windows open after code termination
                            
                                unattended install of python module .exe files
                            
                                Cannot import modules of parent packages inside child packages
                            
                                Webstorm not refreshing modified JavaScript files
                            
                                Python global variable scoping
                            
                                pandas.to_datetime inconsistent time string format
                            
                                HDF5 file (h5py) with version control - hash changes on every save
                            
                                Catch signals in Flask Blueprint
                            
                                Busy box, Run C, python or Perl programs
                            
                                Python - R square and absolute sum of squares obtainable by scipy.optimize curve_fit?
                            
                                Python - typechecking OK when error wouldn't be thrown otherwise?
                            
                                C++ Introspection techniques, similar to python
                            
                                Pandas: Create new dataframe that averages duplicates from another dataframe
                            
                                send google hangout notification using python
                            
                                Django how to modify checkbox labels for MultipleChoiceField?
                            
                                Python: using scikit-learn to predict, gives blank predictions
                            
                                Plot timeseries of histograms in Python
                            
                                Parsing a txt file into a dictionary to write to csv file
                            
                                Python : why a method from super class not seen?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Why should appending to a dataframe of floats and ints be slower than if its full of NaN

Tags:

python

casting

pandas

John 9631

People also ask

1 Answers

BrenBarn

Recent Activity

Donate For Us