I have a Pandas DataFrame that looks similar to this but with 10,000 rows and 500 columns.
For each row, I would like to find the minimum value between 3 days ago at 15:00 and today at 13:30.
Is there some native numpy way to do this quickly? My goal is to be able to get the minimum value for each row by saying something like "what is the minimum value from 3 days ago ago 15:00 to 0 days ago (aka today) 13:30?"
For this particular example the answers for the last two rows would be:
2011-01-09 2481.22
2011-01-10 2481.22
My current way is this:
1. Get the earliest row (only the values after the start time)
2. Get the middle rows
3. Get the last row (only the values before the end time)
4. Concat (1), (2), and (3)
5. Get the minimum of (4)
But this takes a very long time on a large DataFrame
The following code will generate a similar DF:
import numpy
import pandas
import datetime
numpy.random.seed(0)
random_numbers = (numpy.random.rand(10, 8)*100 + 2000)
columns = [datetime.time(13,0) , datetime.time(13,30), datetime.time(14,0), datetime.time(14,30) , datetime.time(15,0), datetime.time(15,30) ,datetime.time(16,0), datetime.time(16,30)]
index = pandas.date_range('2011/1/1', '2011/1/10')
df = pandas.DataFrame(data = random_numbers, columns=columns, index = index).astype(int)
print df
Here is the json version of the dataframe:
'{"13:00:00":{"1293840000000":2085,"1293926400000":2062,"1294012800000":2035,"1294099200000":2086,"1294185600000":2006,"1294272000000":2097,"1294358400000":2078,"1294444800000":2055,"1294531200000":2023,"1294617600000":2024},"13:30:00":{"1293840000000":2045,"1293926400000":2039,"1294012800000":2035,"1294099200000":2045,"1294185600000":2025,"1294272000000":2099,"1294358400000":2028,"1294444800000":2028,"1294531200000":2034,"1294617600000":2010},"14:00:00":{"1293840000000":2095,"1293926400000":2006,"1294012800000":2001,"1294099200000":2032,"1294185600000":2022,"1294272000000":2040,"1294358400000":2024,"1294444800000":2070,"1294531200000":2081,"1294617600000":2095},"14:30:00":{"1293840000000":2057,"1293926400000":2042,"1294012800000":2018,"1294099200000":2023,"1294185600000":2025,"1294272000000":2016,"1294358400000":2066,"1294444800000":2041,"1294531200000":2098,"1294617600000":2023},"15:00:00":{"1293840000000":2082,"1293926400000":2025,"1294012800000":2040,"1294099200000":2061,"1294185600000":2013,"1294272000000":2063,"1294358400000":2024,"1294444800000":2036,"1294531200000":2096,"1294617600000":2068},"15:30:00":{"1293840000000":2090,"1293926400000":2084,"1294012800000":2092,"1294099200000":2003,"1294185600000":2001,"1294272000000":2049,"1294358400000":2066,"1294444800000":2082,"1294531200000":2090,"1294617600000":2005},"16:00:00":{"1293840000000":2081,"1293926400000":2003,"1294012800000":2009,"1294099200000":2001,"1294185600000":2011,"1294272000000":2098,"1294358400000":2051,"1294444800000":2092,"1294531200000":2029,"1294617600000":2073},"16:30:00":{"1293840000000":2015,"1293926400000":2095,"1294012800000":2094,"1294099200000":2042,"1294185600000":2061,"1294272000000":2006,"1294358400000":2042,"1294444800000":2004,"1294531200000":2099,"1294617600000":2088}}'
You can first stack the DataFrame to create a series and then index slice it as required and take the min. For example:
first, last = ('2011-01-07', datetime.time(15)), ('2011-01-10', datetime.time(13, 30))
df.stack().loc[first: last].min()
The result of df.stack
is a Series
with a MultiIndex
where the inner level is composed of the original columns. We then slice using tuple
pairs with the start and end date and times.
If you're going to be doing lots of such operations then you should consider assigning df.stack()
to some variable. You might then consider changing the index to a proper DatetimeIndex
. Then you can work with both the time series and the grid format as required.
Here's another method which avoids stacking and is a lot faster on DataFrames of the size you're actually working with (as a one-off; slicing the stacked DataFrame
is a lot faster once it's stacked so if you're doing many of these operations you should stack and convert the index).
It's less general as it works with min
and max
but not with, say, mean
. It gets the min
of the subset of the first and last rows and the min
of the rows in between (if any) and takes the min
of these three candidates.
first_row = df.index.get_loc(first[0])
last_row = df.index.get_loc(last[0])
if first_row == last_row:
result = df.loc[first[0], first[1]: last[1]].min()
elif first_row < last_row:
first_row_min = df.loc[first[0], first[1]:].min()
last_row_min = df.loc[last[0], :last[1]].min()
middle_min = df.iloc[first_row + 1:last_row].min().min()
result = min(first_row_min, last_row_min, middle_min)
else:
raise ValueError('first row must be <= last row')
Note that if first_row + 1 == last_row
then middle_min
is nan
but the result is still correct as long as middle_min
doesn't come first in the call to min
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With