I am taking data from a file that receives data from Interactive Brokers 5-second OHLCVT bars via Sierra Chart.
Following advice in earlier posts, rather than append each new row to the dataframe I construct a dataframe with the historical file and append 5000 "blank" records to it with correct timestamps. I then write each new row over a blank row, filling any rows if timestamps are missing and updating pointers.
This works well. Here's the current classes and functions. My initial version created 5000 lines of NaNs (OHLCVxyz). I thought it would be tidier to start with the end data types so converted the "blank" records to zeros with OHLC being floats and Vxyz being ints using:
dg.iloc[0:5000] = 0.0
dg[[v, x, y, z]] = dg[[v, x, y, z]].astype('int')
This only occurs once per additional 5000 lines (once a day for HSI). What surprised me was the impact on the read/write loops. They went from 0.8ms to 3.4ms per row. The only change was from NaNs to zeros.
This picture shows an inital run with a zero filled frame (see timestats 0.0038) then a run with a NaN filled frame (timestats 0.0008).
Can anyone provide insight on why it might add so much time to write to fields of [0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0] instead of [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN] ?
Any thoughts on code improvements are also welcome. :)
Thanks
EDIT +17 hours
Following the questions from @BrenBarn I built a simpler model that could be run by anyone without data. In doing so I eliminated the question of whether the NaNs impact it. In this version I was able to write 0.0s to both versions and the difference was the same:
So, unless I'm mistaken (always possible) it seems that adding to a dataframe with 4 columns of floats and 4 of ints takes 10x as long. Is this an issue for pandas or just what one should expect?
Here's the test code and here is the output picture
I think that having an array of 350,000 rows of 8 columns before you add to it makes a significant difference. My initial tests adding to 10 rows showed no impact - I must go back and retest them.
EDIT +10 minutes
No I went back and created the intial array with only 10 rows and the impact on the add loops didn't change so its not the size of the original array/dataframe. Its probable that in my earlier test I thought I'd converted the columns to ints but I hadn't - checking this proved that the command I thought would do this didn't.
da = SierraFrame(range(10), np.zeros((10,8)))
da.extend_frame1()
EDIT and Possible Answer +35 minutes
Should this question not be answered in more detail.
At this point, my hypothesis is that the underlying functionality to add [1.0, 2.0, 3.0, 4.0, 5, 6, 7, 8] to a spare line in the dataframe is different if the df comprises all one type than if it comprises columns of floats and ints. I just tested it with all int64s and the average add was 0.41ms vs 0.37ms for all floats and 2.8ms for a mixed dataframe. Int8s took 0.39ms. I guess that the mix affects pandas ability to optimize its action so if efficiency is very important then a dataframe with all columns being the same type (float64 probably) is the best bet.
Tests conducted on Linux x64 with Python 3.3.1
If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array. since pandas is doing a lot more stuff like aligning labels, dealing with heterogeneous data, and so on.
concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.
Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn't do any of that, so it can perform the same calculations significantly faster.
As described in this blog post by the main author of pandas, a pandas DataFrame is internally made up of "blocks". A block is a group of columns all having the same datatype. Each block is stored as a numpy array of its block type. So if you have five int columns and then five float columns, there will be an int block and a float block.
Appending to a multi-type array requires appending to each of the underlying numpy arrays. Appending to numpy arrays is slow, because it requires creating a whole new numpy array. So it makes sense that appending to a multi-type DataFrame is slow: if all the columns are one type, it only has to create one new numpy array, but if they're different types, it has to create multiple new numpy arrays.
It is true that keeping the data all the same type will speed this up. However, I would say the main conclusion is not "if efficiency is important, keep all your columns the same type". The conclusion is if efficiency is important, do not try to append to your arrays/DataFrames.
This is just how numpy works. The slowest part of working with numpy arrays is creating them in the first place. They have a fixed size, and when you "append" to one, you really are just creating an entirely new one with the new size, which is slow. If you absolutely must append to them, you can try stuff like messing with types to ease the pain somewhat. But ultimately you just have to accept that any time you try to append to a DataFrame (or a numpy array in general), you will likely suffer a substantial performance hit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With