Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame: How to natively get minimum across range of rows and columns

I have a Pandas DataFrame that looks similar to this but with 10,000 rows and 500 columns.

My Dataframe

For each row, I would like to find the minimum value between 3 days ago at 15:00 and today at 13:30.

Is there some native numpy way to do this quickly? My goal is to be able to get the minimum value for each row by saying something like "what is the minimum value from 3 days ago ago 15:00 to 0 days ago (aka today) 13:30?"

For this particular example the answers for the last two rows would be:

2011-01-09 2481.22
2011-01-10 2481.22

My current way is this:

1. Get the earliest row (only the values after the start time)
2. Get the middle rows 
3. Get the last row (only the values before the end time)
4. Concat (1), (2), and (3)
5. Get the minimum of (4)

But this takes a very long time on a large DataFrame


The following code will generate a similar DF:

import numpy
import pandas
import datetime

numpy.random.seed(0)

random_numbers = (numpy.random.rand(10, 8)*100 + 2000)
columns        = [datetime.time(13,0) , datetime.time(13,30), datetime.time(14,0), datetime.time(14,30) , datetime.time(15,0), datetime.time(15,30) ,datetime.time(16,0), datetime.time(16,30)] 
index          = pandas.date_range('2011/1/1', '2011/1/10')
df             = pandas.DataFrame(data = random_numbers, columns=columns, index = index).astype(int)

print df

Here is the json version of the dataframe:

'{"13:00:00":{"1293840000000":2085,"1293926400000":2062,"1294012800000":2035,"1294099200000":2086,"1294185600000":2006,"1294272000000":2097,"1294358400000":2078,"1294444800000":2055,"1294531200000":2023,"1294617600000":2024},"13:30:00":{"1293840000000":2045,"1293926400000":2039,"1294012800000":2035,"1294099200000":2045,"1294185600000":2025,"1294272000000":2099,"1294358400000":2028,"1294444800000":2028,"1294531200000":2034,"1294617600000":2010},"14:00:00":{"1293840000000":2095,"1293926400000":2006,"1294012800000":2001,"1294099200000":2032,"1294185600000":2022,"1294272000000":2040,"1294358400000":2024,"1294444800000":2070,"1294531200000":2081,"1294617600000":2095},"14:30:00":{"1293840000000":2057,"1293926400000":2042,"1294012800000":2018,"1294099200000":2023,"1294185600000":2025,"1294272000000":2016,"1294358400000":2066,"1294444800000":2041,"1294531200000":2098,"1294617600000":2023},"15:00:00":{"1293840000000":2082,"1293926400000":2025,"1294012800000":2040,"1294099200000":2061,"1294185600000":2013,"1294272000000":2063,"1294358400000":2024,"1294444800000":2036,"1294531200000":2096,"1294617600000":2068},"15:30:00":{"1293840000000":2090,"1293926400000":2084,"1294012800000":2092,"1294099200000":2003,"1294185600000":2001,"1294272000000":2049,"1294358400000":2066,"1294444800000":2082,"1294531200000":2090,"1294617600000":2005},"16:00:00":{"1293840000000":2081,"1293926400000":2003,"1294012800000":2009,"1294099200000":2001,"1294185600000":2011,"1294272000000":2098,"1294358400000":2051,"1294444800000":2092,"1294531200000":2029,"1294617600000":2073},"16:30:00":{"1293840000000":2015,"1293926400000":2095,"1294012800000":2094,"1294099200000":2042,"1294185600000":2061,"1294272000000":2006,"1294358400000":2042,"1294444800000":2004,"1294531200000":2099,"1294617600000":2088}}'

like image 673
user1367204 Avatar asked Aug 06 '15 22:08

user1367204


1 Answers

You can first stack the DataFrame to create a series and then index slice it as required and take the min. For example:

first, last = ('2011-01-07', datetime.time(15)), ('2011-01-10', datetime.time(13, 30))
df.stack().loc[first: last].min()

The result of df.stack is a Series with a MultiIndex where the inner level is composed of the original columns. We then slice using tuple pairs with the start and end date and times. If you're going to be doing lots of such operations then you should consider assigning df.stack() to some variable. You might then consider changing the index to a proper DatetimeIndex. Then you can work with both the time series and the grid format as required.

Here's another method which avoids stacking and is a lot faster on DataFrames of the size you're actually working with (as a one-off; slicing the stacked DataFrame is a lot faster once it's stacked so if you're doing many of these operations you should stack and convert the index).
It's less general as it works with min and max but not with, say, mean. It gets the min of the subset of the first and last rows and the min of the rows in between (if any) and takes the min of these three candidates.

first_row = df.index.get_loc(first[0])
last_row = df.index.get_loc(last[0])
if first_row == last_row:
    result = df.loc[first[0], first[1]: last[1]].min()
elif first_row < last_row:
    first_row_min = df.loc[first[0], first[1]:].min()
    last_row_min = df.loc[last[0], :last[1]].min()
    middle_min = df.iloc[first_row + 1:last_row].min().min()
    result = min(first_row_min, last_row_min, middle_min)
else: 
    raise ValueError('first row must be <= last row')

Note that if first_row + 1 == last_row then middle_min is nan but the result is still correct as long as middle_min doesn't come first in the call to min.

like image 113
JoeCondron Avatar answered Sep 17 '22 23:09

JoeCondron