Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Data Frame: cumulative sum of column until condition is reached and return the index

I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.

Say I have a simple data frame with two columns:

index  Num_Albums  Num_authors
0      10          4
1      1           5
2      4           4
3      7           1000
4      1           44
5      3           8

Num_Abums_tot = sum(Num_Albums) = 30

I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.

Example: cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):

10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop. 

Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4

I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....

[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].

like image 723
AMaz Avatar asked Jan 05 '17 15:01

AMaz


People also ask

How do you do a cumulative sum in a DataFrame in Python?

The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.

How do you find the sum of a column in Python?

sum() to Sum All Columns. Use DataFrame. sum() to get sum/total of a DataFrame for both rows and columns, to get the total sum of columns use axis=1 param. By default, this method takes axis=0 which means summing of rows.


2 Answers

Opt - 1:

You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.

Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.

Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.

val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val]      # for faster access, use .iat 
4

When performing np.isclose on the series later converted to an array:

np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False,  True, False, False, False], dtype=bool)

Opt - 2:

Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.

val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4

Opt - 3:

Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:

df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4
like image 159
Nickil Maveli Avatar answered Oct 29 '22 22:10

Nickil Maveli


I think you can directly add a column with the cumulative sum as:

In [3]: df
Out[3]: 
   index  Num_Albums  Num_authors
0      0          10            4
1      1           1            5
2      2           4            4
3      3           7         1000
4      4           1           44
5      5           3            8

In [4]: df['cumsum'] = df['Num_Albums'].cumsum()

In [5]: df
Out[5]: 
   index  Num_Albums  Num_authors  cumsum
0      0          10            4      10
1      1           1            5      11
2      2           4            4      15
3      3           7         1000      22
4      4           1           44      23
5      5           3            8      26

And then apply the condition you want on the cumsum column. For instance you can use where to get the full row according to the filter. Setting the tolerance tol:

In [18]: tol = 2

In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()

In [20]: cond
Out[20]: 
   index  Num_Albums  Num_authors  cumsum
2    2.0         4.0          4.0    15.0
like image 30
Fabio Lamanna Avatar answered Oct 29 '22 23:10

Fabio Lamanna