I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly. Say I have a simple data frame with two columns: <pre class="prettyprint"><code>index Num_Albums Num_authors 0 10 4 1 1 5 2 4 4 3 7 1000 4 1 44 5 3 8 Num_Abums_tot = sum(Num_Albums) = 30 </code></pre> I need to do a cumulative sum of the data in <code>Num_Albums</code> until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from <code>Num_authors</code>. Example: cumulative sum of <code>Num_Albums</code> until the sum equals 50% ± 1/15 of 30 (--> 15±2): <pre class="prettyprint"><code>10 = 15±2? No, then continue; 10+1 =15±2? No, then continue 10+1+41 = 15±2? Yes, stop. </code></pre> Condition reached at index 2. Then get <code>Num_Authors</code> at that index: <code>Num_Authors(2)=4</code> I would like to see if there's a function already implemented in <code>pandas</code>, before I start thinking how to do it with a while/for loop.... [I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].

Opt - 1: You could compute the cumulative sum using <code>cumsum</code>. Then use <code>np.isclose</code> with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array. Through <code>np.flatnonzero</code>, return the ordinal values of the indices for which the <code>True</code> condition holds. We select the first instance of a <code>True</code> value. Finally, use <code>.iloc</code> to retrieve value of the column name you require based on the index computed earlier. <pre class="prettyprint"><code>val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0] df['Num_authors'].iloc[val] # for faster access, use .iat 4 </code></pre> When performing <code>np.isclose</code> on the <code>series</code> later converted to an array: <pre class="prettyprint"><code>np.isclose(df.Num_Albums.cumsum().values, 15, atol=2) array([False, False, True, False, False, False], dtype=bool) </code></pre> Opt - 2: Use <code>pd.Index.get_loc</code> on the <code>cumsum</code> calculated series which also supports a <code>tolerance</code> parameter on the <code>nearest</code> method. <pre class="prettyprint"><code>val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2) df.get_value(val, 'Num_authors') 4 </code></pre> Opt - 3: Use <code>idxmax</code> to find the first index of a <code>True</code> value for the boolean mask created after <code>sub</code> and <code>abs</code> operations on the <code>cumsum</code> series: <pre class="prettyprint"><code>df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors') 4 </code></pre>

I think you can directly add a column with the cumulative sum as: <pre class="prettyprint"><code>In [3]: df Out[3]: index Num_Albums Num_authors 0 0 10 4 1 1 1 5 2 2 4 4 3 3 7 1000 4 4 1 44 5 5 3 8 In [4]: df['cumsum'] = df['Num_Albums'].cumsum() In [5]: df Out[5]: index Num_Albums Num_authors cumsum 0 0 10 4 10 1 1 1 5 11 2 2 4 4 15 3 3 7 1000 22 4 4 1 44 23 5 5 3 8 26 </code></pre> And then apply the condition you want on the <code>cumsum</code> column. For instance you can use <code>where</code> to get the full row according to the filter. Setting the tolerance <code>tol</code>: <pre class="prettyprint"><code>In [18]: tol = 2 In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna() In [20]: cond Out[20]: index Num_Albums Num_authors cumsum 2 2.0 4.0 4.0 15.0 </code></pre>

Python Data Frame: cumulative sum of column until condition is reached and return the index

Tags:

python

pandas

dataframe

sum

I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.

Say I have a simple data frame with two columns:

index  Num_Albums  Num_authors
0      10          4
1      1           5
2      4           4
3      7           1000
4      1           44
5      3           8

Num_Abums_tot = sum(Num_Albums) = 30

I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.

Example: cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):

10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.

Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4

I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....

[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].

723

asked Jan 05 '17 15:01

AMaz

2 Answers

Opt - 1:

You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.

Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.

Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.

val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val]      # for faster access, use .iat 
4

When performing np.isclose on the series later converted to an array:

np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False,  True, False, False, False], dtype=bool)

Opt - 2:

Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.

val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4

Opt - 3:

Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:

df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4

159

answered Oct 29 '22 22:10

Nickil Maveli

I think you can directly add a column with the cumulative sum as:

In [3]: df
Out[3]: 
   index  Num_Albums  Num_authors
0      0          10            4
1      1           1            5
2      2           4            4
3      3           7         1000
4      4           1           44
5      5           3            8

In [4]: df['cumsum'] = df['Num_Albums'].cumsum()

In [5]: df
Out[5]: 
   index  Num_Albums  Num_authors  cumsum
0      0          10            4      10
1      1           1            5      11
2      2           4            4      15
3      3           7         1000      22
4      4           1           44      23
5      5           3            8      26

And then apply the condition you want on the cumsum column. For instance you can use where to get the full row according to the filter. Setting the tolerance tol:

In [18]: tol = 2

In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()

In [20]: cond
Out[20]: 
   index  Num_Albums  Num_authors  cumsum
2    2.0         4.0          4.0    15.0

answered Oct 29 '22 23:10

Fabio Lamanna

Related questions
                            
                                Difference between jinja2 functions and filters?
                            
                                Python read-only lists using the property decorator
                            
                                Importing SciPy or scikit-image, "from scipy.linalg import _fblas: Import Error: DLL failed"
                            
                                NLTK other language POS tagger
                            
                                Ansible: Access host/group vars from within custom module
                            
                                How to run code after Flask send_file() or send_from_directory()
                            
                                Renaming downloaded images in Scrapy 0.24 with content from an item field while avoiding filename conflicts?
                            
                                How to save Python NLTK alignment models for later use?
                            
                                Using coverage, how do I test this line?
                            
                                Errno 2 using python shutil.py No such file or directory for file destination
                            
                                Increasing speed of a pure Numpy/Scipy convolutional neural network implementation
                            
                                Python futurize without replacing / with old_div
                            
                                where is the ./configure of TensorFlow and how to enable the GPU support?
                            
                                What does "dict-like" mean in Python?
                            
                                csv: writer.writerows() splitting my string inputs
                            
                                Should variable names have adjectives before or after the noun? [closed]
                            
                                Generating random vectors of Euclidean norm <= 1 in Python?
                            
                                Tox installs the wrong version of pip to it's virtual env
                            
                                Pandas setting multi-index on rows, then transposing to columns
                            
                                Why does Python's set difference method take time with an empty set?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With