I am using Python's deque()
to implement a simple circular buffer:
from collections import deque
import numpy as np
test_sequence = np.array(range(100)*2).reshape(100,2)
mybuffer = deque(np.zeros(20).reshape((10, 2)))
for i in test_sequence:
mybuffer.popleft()
mybuffer.append(i)
do_something_on(mybuffer)
I was wondering if there's a simple way of obtaining the same thing in Pandas using a Series
(or DataFrame
). In other words, how can I efficiently add a single row at the end and remove a single row at the beginning of a Series
or DataFrame
?
Edit: I tried this:
myPandasBuffer = pd.DataFrame(columns=('A','B'), data=np.zeros(20).reshape((10, 2)))
newpoint = pd.DataFrame(columns=('A','B'), data=np.array([[1,1]]))
for i in test_sequence:
newpoint[['A','B']] = i
myPandasBuffer = pd.concat([myPandasBuffer.ix[1:],newpoint], ignore_index = True)
do_something_on(myPandasBuffer)
But it's painfully slower than the deque()
method.
For lists, it's always O(1). So, for accessing elements, lists are always a better choice, it's not at all what deques were designed for. Second, because deques are implemented as doubly-ended arrays, they have the advantage when appending or popping from both the right and the left side of a deque (measured as O(1)).
Python's deque was the first data type added to the collections module back in Python 2.4.
The deque data structure from the collections module does not have a peek method, but similar results can be achieved by fetching the elements with square brackets. The first element can be accessed using [0] and the last element can be accessed using [-1].
Python queue is a built in library that allows you to create a list that uses the FIFO rule, first in first out. Python deque uses the opposite rule, LIFO queue, or last in first out. Both operate on stacks and queues. When you're working in Python, you may want to create a queue of items instead of a list.
As noted by dorvak, pandas is not designed for queue-like behaviour.
Below I've replicated the simple insert function from deque in pandas dataframes, numpy arrays, and also in hdf5 using the h5py module.
The timeit function reveals (unsurprisingly) that the collections module is much faster, followed by numpy and then pandas.
from collections import deque
import pandas as pd
import numpy as np
import h5py
def insert_deque(test_sequence, buffer_deque):
for item in test_sequence:
buffer_deque.popleft()
buffer_deque.append(item)
return buffer_deque
def insert_df(test_sequence, buffer_df):
for item in test_sequence:
buffer_df.iloc[0:-1,:] = buffer_df.iloc[1:,:].values
buffer_df.iloc[-1] = item
return buffer_df
def insert_arraylike(test_sequence, buffer_arr):
for item in test_sequence:
buffer_arr[:-1] = buffer_arr[1:]
buffer_arr[-1] = item
return buffer_arr
test_sequence = np.array(list(range(100))*2).reshape(100,2)
# create buffer arrays
nested_list = [[0]*2]*5
buffer_deque = deque(nested_list)
buffer_df = pd.DataFrame(nested_list, columns=('A','B'))
buffer_arr = np.array(nested_list)
# calculate speed of each process in ipython
print("deque : ")
%timeit insert_deque(test_sequence, buffer_deque)
print("pandas : ")
%timeit insert_df(test_sequence, buffer_df)
print("numpy array : ")
%timeit insert_arraylike(test_sequence, buffer_arr)
print("hdf5 with h5py : ")
with h5py.File("h5py_test.h5", "w") as f:
f["buffer_hdf5"] = np.array(nested_list)
%timeit insert_arraylike(test_sequence, f["buffer_hdf5"])
The %timeit results:
deque : 34.1 µs per loop
pandas : 48 ms per loop
numpy array : 187 µs per loop
hdf5 with h5py : 31.7 ms per loop
Notes:
My pandas slicing method was only slightly faster than the concat method listed in the question.
The hdf5 format (via h5py) did not show any advantages. I also don't see any advantages of HDFStore, as suggested by Andy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With