Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interval datatype in Pandas - find midpoint, left, center etc

In pandas 20.1, with the interval type, is it possible to find the midpoint, left or center values in a series.

For example:

  1. Create an interval datatype column, and perform some aggregation calculations over these intervals:

    df_Stats = df.groupby(['month',pd.cut(df['Distances'], np.arange(0, 135,1))]).agg(aggregations)
    

This returns df_Stats with an interval column datatype : df['Distances']

  1. Now I want to associate the left end of the interval to the result of these aggregations using a series function:

    df['LeftEnd'] = df['Distances'].left
    

However, I can run this element wise:

    df.loc[0]['LeftEnd'] = df.loc[0]['Distances'].left

This works. Thoughts?

like image 284
penguin Avatar asked May 20 '17 16:05

penguin


People also ask

What is interval data in Python?

Defined interval data as a quantitative data type that groups variables into ranked categories, using continuous numerical values. Explained the difference between interval and ratio data: Both are types of numerical data. However, interval data lacks a true zero, whereas ratio data does not.

How to handle time series data in pandas?

Using pandas datetime properties. Initially, the values in datetime are character strings and do not provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. datetime64[ns, UTC] ) objects.


2 Answers

So pd.cut() actually creates a CategoricalIndex, with an IntervalIndex as the categories.

In [13]: df = pd.DataFrame({'month': [1, 1, 2, 2], 'distances': range(4), 'value': range(4)})

In [14]: df
Out[14]: 
   distances  month  value
0          0      1      0
1          1      1      1
2          2      2      2
3          3      2      3

In [15]: result = df.groupby(['month', pd.cut(df.distances, 2)]).value.mean()

In [16]: result
Out[16]: 
month  distances    
1      (-0.003, 1.5]    0.5
2      (1.5, 3.0]       2.5
Name: value, dtype: float64

You can simply coerce them to an IntervalIndex (this also works if they are a column), then access.

In [17]: pd.IntervalIndex(result.index.get_level_values('distances')).left
Out[17]: Float64Index([-0.003, 1.5], dtype='float64')

In [18]: pd.IntervalIndex(result.index.get_level_values('distances')).right
Out[18]: Float64Index([1.5, 3.0], dtype='float64')

In [19]: pd.IntervalIndex(result.index.get_level_values('distances')).mid
Out[19]: Float64Index([0.7485, 2.25], dtype='float64')
like image 55
Jeff Avatar answered Oct 10 '22 21:10

Jeff


Say 'cut' is the column name after performing pd.cut.

instead of ->

 df['LeftEnd'] = df['Distances'].left

perform one of the following -->

 df['LeftEnd'] = df['cut'].apply(lambda x: x.left)

 df['LeftEnd'] = df['cut'].apply(lambda x: x.left).astype(str)
like image 27
Mahesh Babu J Avatar answered Oct 10 '22 20:10

Mahesh Babu J