Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify the number of rows a pandas dataframe will have?

I have a Pandas dataframe and I am continually appending a row of data each second as below.

df.loc[time.strftime("%Y-%m-%d %H:%M:%S")] = [reading1, reading2, reading3]
>>>df
                     sensor1 sensor2 sensor3
2015-04-14 08:50:23    5.4     5.6     5.7
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4

If I continue this, eventually I am going to start experiencing memory issues (Each time it will call the whole DataFrame).

I only need to keep X rows of the data. i.e. after the operation, it will be:

>>>df
                     sensor1 sensor2 sensor3
(this row is gone)
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4
2015-04-14 08:50:27    5.2     5.4     5.6

Is there a way I can specify a maximum number of rows, so that when any subsequent rows are added, the oldest row is deleted at the same time WITHOUT a "Check length of DataFrame, If length of DataFrame > X, Remove first row, Append new row"?

Like this, but for a Pandas DataFrame: https://stackoverflow.com/a/10155753/4783578

like image 751
ps.george Avatar asked Apr 13 '15 15:04

ps.george


2 Answers

One way would be to pre-allocate the rows, and replace the values cyclically.

# Say we to limit to a thousand rows
N = 1000

# Create the DataFrame with N rows and 5 columns -- all NaNs
data = pd.DataFrame(pd.np.empty((N, 5)) * pd.np.nan) 

# To check the length of the DataFrame, we'll need to .dropna().
len(data.dropna())              # Returns 0

# Keep a running counter of the next index to insert into
counter = 0

# Insertion always happens at that counter
data.loc[counter, :] = pd.np.random.rand(5)

# ... and increment the counter, but when it exceeds N, set it to 0
counter = (counter + 1) % N

# Now, the DataFrame contains one row
len(data.dropna())              # Returns 1

# We can add several rows one after another. Let's add twice as many as N
for row in pd.np.random.rand(2 * N, 5):
    data.loc[counter, :] = row
    counter = (counter + 1) % N

# Now that we added them, we still have only the last N rows
len(data)                       # Returns N

This avoids the need to modify the data in any way, and would be a fast approach to appending the data. However, to reading from the data can be slower if:

  • The order of the data matters. If you need the data in the same order, you need to slice data using counter to extract the original order.
  • The number of rows is small. If you end up appending fewer rows than N, you'll need .dropna() (or count the total inserted rows) to remove the unused ones.

In most of the scenarios that I deal with where truncated append performance matters, neither of the above are true, but your scenario may be different. In that case, @Alexander has a good solution involving .shift().

like image 174
S Anand Avatar answered Sep 21 '22 19:09

S Anand


pandas stores data in arrays. Doing the sort of operation you want inherently requires a copy for an array data structure. Since data is stored in contiguous (or strided) memory, adding something to the end and removing something from the beginning requires copying everything to a new region of memory. There is no way around this. You need to use a different data structure.

Edit: Thinking about this a bit more, I see two approaches to do this.

The simplest and most straightforward would be to use a collections.deque of tuples. You can just append a new tuple to the end, and if it gets too full it will dump the corresponding on from the beginning. At the end, you can just convert them into a DataFrame. I am just using the for loop as an example, I gather you get your data in a different way. It wouldn't matter:

import pandas as pd
from collections import deque

maxlen = 1000

dq = deque(maxlen=maxlen)

for reading1, reading3, reading3 in readings:
    dq.append(pd.Series([reading1, reading2, reading3], 
                        index=['sensor1', 'sensor2', 'sensor3'], 
                        name=time.strftime("%Y-%m-%d %H:%M:%S")))

df = pd.concat(dq, axis=1).T

The second approach is to use a DataFrame of a fixed size, and use the modulo of the maximum length to choose the place to overwrite, but also keep the item number in the DataFrame. Then you can sort by item number. In your case, you could conceivably sort by time, but this approach is more general. As with the previous example, I will use a for loop to demonstrate, but you probably don't have one. Further, I will also assume that you don't have a real iterable you can enumerate, if you do then you don't have to keep track of the index number as I do here:

import pandas as pd

maxlen = 1000

df = pd.DataFrame(np.full((maxlen, 5), np.nan),
                  columns=['index', 'time', 
                           'sensor1', 'sensor2', 'sensor3'])

i = 0
for reading1, reading3, reading3 in readings:
    df.loc[i%maxlen, :] = [i, time.strftime("%Y-%m-%d %H:%M:%S"),
                           reading1, reading2, reading3]
    i+=1

df.sort('index', inplace=True)
del df['index']
df.set_index('time', drop=True, inplace=True)
like image 29
TheBlackCat Avatar answered Sep 24 '22 19:09

TheBlackCat