Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Panda Dataframe Resampling based on column criteria

I want to resample a dataframe if cell in another column matches my criteria

df = pd.DataFrame({
        'timestamp': [
            '2013-03-01 08:01:00', '2013-03-01 08:02:00',
            '2013-03-01 08:03:00', '2013-03-01 08:04:00',
            '2013-03-01 08:05:00', '2013-03-01 08:06:00'
        ],
        'Kind': [
            'A', 'B', 'A', 'B', 'A', 'B'
        ],
        'Values': [1, 1.5, 2, 3, 5, 3]
    })

For every timestamp, I may have 2-10 kinds, and I want to resample correctly without producing NaN. Currently I resample on the entire dataframe using below code and get NaNs. I think it's due to I have multiple entries for certain timestamps.

df.set_index('timestamp').resample('5Min').mean()

One method is to create different dataframes for every kind, resample every dataframe, and join the resulting dataframes. I'd like to find out if there's any simple way of doing it.

like image 270
yusica Avatar asked Jan 12 '17 18:01

yusica


1 Answers

After defining your dataframe as you stated, you should transform timestamp column to datetime first. Then set it as the index and finally resampling and finding the mean as follows:

import pandas as pd
df = pd.DataFrame({
        'timestamp': [
            '2013-03-01 08:01:00', '2013-03-01 08:02:00',
            '2013-03-01 08:03:00', '2013-03-01 08:04:00',
            '2013-03-01 08:05:00', '2013-03-01 08:06:00'
        ],
        'Kind': [
            'A', 'B', 'A', 'B', 'A', 'B'
        ],
        'Values': [1, 1.5, 2, 3, 5, 3]
    })

df.timestamp = pd.to_datetime(df.timestamp)
df = df.set_index(["timestamp"])
df = df.resample("5Min")    
print df.mean()

This would print the mean you expect:

>>> 
Values    2.75

And your dataframe would result in:

>>> df
                     Values
timestamp                  
2013-03-01 08:05:00     2.5
2013-03-01 08:10:00     3.0

Grouping by kind

If you want to group by kind and get the mean of each Kind (means A and B) you can do as follows:

df.timestamp = pd.to_datetime(df.timestamp)
df = df.set_index(["timestamp"])
gb = df.groupby(["Kind"])
df = gb.resample("5Min")
print df.xs("A", level = "Kind").mean()
print df.xs("B", level = "Kind").mean()

As result you would get:

>>> 
Values    2.666667
Values    2.625

And your dataframe would finally look as:

>>> df
                            Values
Kind timestamp                    
A    2013-03-01 08:05:00  2.666667
B    2013-03-01 08:05:00  2.250000
     2013-03-01 08:10:00  3.000000
like image 175
Cedric Zoppolo Avatar answered Sep 26 '22 00:09

Cedric Zoppolo