<p>Suppose I have a Pandas Series <code>s</code> whose values sum to <code>1</code> and whose values are also all greater than or equal to <code>0</code>. I need to subtract a constant from all values such that the sum of the new Series is equal to <code>0.6</code>. The catch is, when I subtract this constant, the values never end up less than zero.</p> <p>In math formula, assume I have a series of <code>x</code>'s and I want to find <code>k</code></p> <p><img src="https://i.stack.imgur.com/LphIb.png" alt="enter image description here"></p> <h3>MCVE</h3> <pre class="prettyprint"><code>import pandas as pd import numpy as np from string import ascii_uppercase np.random.seed([3, 141592653]) s = np.power( 1000, pd.Series( np.random.rand(10), list(ascii_uppercase[:10]) ) ).pipe(lambda s: s / s.sum()) s A 0.001352 B 0.163135 C 0.088365 D 0.010904 E 0.007615 F 0.407947 G 0.005856 H 0.198381 I 0.027455 J 0.088989 dtype: float64 </code></pre> <p>The sum is <code>1</code></p> <pre class="prettyprint"><code>s.sum() 0.99999999999999989 </code></pre> <h3>What I've tried</h3> <p>I can use Newton's method (among others) found in Scipy's <code>optimize</code> module</p> <pre class="prettyprint"><code>from scipy.optimize import newton def f(k): return s.sub(k).clip(0).sum() - .6 </code></pre> <p>Finding the root of this function will give me the <code>k</code> I need</p> <pre class="prettyprint"><code>initial_guess = .1 k = newton(f, x0=initial_guess) </code></pre> <p>Then subtract this from <code>s</code></p> <pre class="prettyprint"><code>new_s = s.sub(k).clip(0) new_s A 0.000000 B 0.093772 C 0.019002 D 0.000000 E 0.000000 F 0.338583 G 0.000000 H 0.129017 I 0.000000 J 0.019626 dtype: float64 </code></pre> <p>And the new sum is</p> <pre class="prettyprint"><code>new_s.sum() 0.60000000000000009 </code></pre> <h3>Question</h3> <p>Can we find <code>k</code> without resorting to using a solver?</p>

<p>I was not expecting <code>newton</code> to carry the day. But on large arrays, it does.</p> <h3><code>numba.njit</code></h3> <p>Inspire by Thierry's Answer<br> Using a loop on a sorted array with <code>numba</code>s <code>jit</code></p> <pre class="prettyprint"><code>import numpy as np from numba import njit @njit def find_k_numba(a, t): a = np.sort(a) m = len(a) s = a.sum() to_remove = s - t if to_remove <= 0: k = 0 else: for i, x in enumerate(a): k = to_remove / (m - i) if k < x: break else: to_remove -= x return k </code></pre> <hr> <h3><code>numpy</code></h3> <p>Inspired by Paul's Answer<br> Numpy carrying the heavy lifting. </p> <pre class="prettyprint"><code>import numpy as np def find_k_numpy(a, t): a = np.sort(a) m = len(a) s = a.sum() to_remove = s - t if to_remove <= 0: k = 0 else: c = a.cumsum() n = np.arange(m)[::-1] b = n * a + c i = np.searchsorted(b, to_remove) k = a[i] + (to_remove - b[i]) / (m - i) return k </code></pre> <hr> <h3><code>scipy.optimize.newton</code></h3> <p>My method via Newton</p> <pre class="prettyprint"><code>import numpy as np from scipy.optimize import newton def find_k_newton(a, t): s = a.sum() if s <= t: k = 0 else: def f(k_): return np.clip(a - k_, 0, a.max()).sum() - t k = newton(f, (s - t) / len(a)) return k </code></pre> <hr> <h3>Time Trials</h3> <pre class="prettyprint"><code>import pandas as pd from timeit import timeit res = pd.DataFrame( np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000], 'find_k_newton find_k_numpy find_k_numba'.split() ) for i in res.index: a = np.random.rand(i) t = a.sum() * .6 for j in res.columns: stmt = f'{j}(a, t)' setp = f'from __main__ import {j}, a, t' res.at[i, j] = timeit(stmt, setp, number=200) </code></pre> <h3>Results</h3> <pre class="prettyprint"><code>res.plot(loglog=True) </code></pre> <p><img src="https://i.stack.imgur.com/OyBLG.png" alt="enter image description here"></p> <pre class="prettyprint"><code>res.div(res.min(1), 0) find_k_newton find_k_numpy find_k_numba 10 57.265421 17.552150 1.000000 30 29.221947 9.420263 1.000000 100 16.920463 5.294890 1.000000 300 10.761341 3.037060 1.000000 1000 1.455159 1.033066 1.000000 3000 1.000000 2.076484 2.550152 10000 1.000000 3.798906 4.421955 30000 1.000000 5.551422 6.784594 </code></pre>

Close form solution for finding a root

Tags:

performance

python

pandas

numpy

numba

Suppose I have a Pandas Series s whose values sum to 1 and whose values are also all greater than or equal to 0. I need to subtract a constant from all values such that the sum of the new Series is equal to 0.6. The catch is, when I subtract this constant, the values never end up less than zero.

In math formula, assume I have a series of x's and I want to find k

enter image description here

MCVE

Click to copy

import pandas as pd
import numpy as np
from string import ascii_uppercase

np.random.seed([3, 141592653])
s = np.power(
    1000, pd.Series(
        np.random.rand(10),
        list(ascii_uppercase[:10])
    )
).pipe(lambda s: s / s.sum())

s

A    0.001352
B    0.163135
C    0.088365
D    0.010904
E    0.007615
F    0.407947
G    0.005856
H    0.198381
I    0.027455
J    0.088989
dtype: float64

The sum is 1

Click to copy

s.sum()

0.99999999999999989

What I've tried

I can use Newton's method (among others) found in Scipy's optimize module

Click to copy

from scipy.optimize import newton

def f(k):
    return s.sub(k).clip(0).sum() - .6

Finding the root of this function will give me the k I need

Click to copy

initial_guess = .1
k = newton(f, x0=initial_guess)

Then subtract this from s

Click to copy

new_s = s.sub(k).clip(0)
new_s

A    0.000000
B    0.093772
C    0.019002
D    0.000000
E    0.000000
F    0.338583
G    0.000000
H    0.129017
I    0.000000
J    0.019626
dtype: float64

And the new sum is

Click to copy

new_s.sum()

0.60000000000000009

Question

Can we find k without resorting to using a solver?

571

asked Mar 16 '18 06:03

piRSquared

2 Answers

I was not expecting newton to carry the day. But on large arrays, it does.

`numba.njit`

Inspire by Thierry's Answer
Using a loop on a sorted array with numbas jit

Click to copy

import numpy as np
from numba import njit


@njit
def find_k_numba(a, t):
    a = np.sort(a)
    m = len(a)
    s = a.sum()
    to_remove = s - t

    if to_remove <= 0:
        k = 0
    else:
        for i, x in enumerate(a):
            k = to_remove / (m - i)
            if k < x:
                break
            else:
                to_remove -= x
    return k

`numpy`

Inspired by Paul's Answer
Numpy carrying the heavy lifting.

Click to copy

import numpy as np

def find_k_numpy(a, t):
    a = np.sort(a)
    m = len(a)
    s = a.sum()
    to_remove = s - t

    if to_remove <= 0:
        k = 0
    else:
        c = a.cumsum()
        n = np.arange(m)[::-1]
        b = n * a + c
        i = np.searchsorted(b, to_remove)
        k = a[i] + (to_remove - b[i]) / (m - i)
    return k

`scipy.optimize.newton`

My method via Newton

Click to copy

import numpy as np
from scipy.optimize import newton


def find_k_newton(a, t):
    s = a.sum()
    if s <= t:
        k = 0
    else:
        def f(k_):
            return np.clip(a - k_, 0, a.max()).sum() - t

        k = newton(f, (s - t) / len(a))

    return k

Time Trials

Click to copy

import pandas as pd
from timeit import timeit


res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'find_k_newton find_k_numpy find_k_numba'.split()
)

for i in res.index:
    a = np.random.rand(i)
    t = a.sum() * .6
    for j in res.columns:
        stmt = f'{j}(a, t)'
        setp = f'from __main__ import {j}, a, t'
        res.at[i, j] = timeit(stmt, setp, number=200)

Results

Click to copy

res.plot(loglog=True)

enter image description here

Click to copy

res.div(res.min(1), 0)

       find_k_newton  find_k_numpy  find_k_numba
10         57.265421     17.552150      1.000000
30         29.221947      9.420263      1.000000
100        16.920463      5.294890      1.000000
300        10.761341      3.037060      1.000000
1000        1.455159      1.033066      1.000000
3000        1.000000      2.076484      2.550152
10000       1.000000      3.798906      4.421955
30000       1.000000      5.551422      6.784594

188

answered Sep 28 '22 08:09

piRSquared

Updated: Three different implementations - interestingly, the least sophisticated scales best.

Click to copy

import numpy as np

def f_sort(A, target=0.6):
    B = np.sort(A)
    C = np.cumsum(np.r_[B[0], np.diff(B)] * np.arange(N, 0, -1))
    idx = np.searchsorted(C, 1 - target)
    return B[idx] + (1 - target - C[idx]) / (N-idx)

def f_partition(A, target=0.6):
    target, l = 1 - target, len(A)
    while len(A) > 1:
        m = len(A) // 2
        A = np.partition(A, m-1)
        ls = A[:m].sum()
        if ls + A[m-1] * (l-m) > target:
            A = A[:m]
        else:
            l -= m
            target -= ls
            A = A[m:]
    return target / l            

def f_direct(A, target=0.6):
    target = 1 - target
    while True:
        gt = A > target / len(A)
        if np.all(gt):
            return target / len(A)
        target -= A[~gt].sum()
        A = A[gt]

N = 10
A = np.random.random(N)
A /= A.sum()

print(f_sort(A), np.clip(A-f_sort(A), 0, None).sum())
print(f_partition(A), np.clip(A-f_partition(A), 0, None).sum())
print(f_direct(A), np.clip(A-f_direct(A), 0, None).sum())

from timeit import timeit
kwds = dict(globals=globals(), number=1000)

N = 100000
A = np.random.random(N)
A /= A.sum()

print(timeit('f_sort(A)', **kwds))
print(timeit('f_partition(A)', **kwds))
print(timeit('f_direct(A)', **kwds))

Sample run:

Click to copy

0.04813686999999732 0.5999999999999999
0.048136869999997306 0.6000000000000001
0.048136869999997306 0.6000000000000001
8.38109541599988
2.1064437470049597
1.2743922089866828

answered Sep 28 '22 09:09

Paul Panzer

Related questions
                            
                                AttributeError: module 'django.contrib.auth.views' has no attribute 'login'
                            
                                Pandas: categorize column values by range
                            
                                Train test split without using scikit learn
                            
                                AttributeError: 'numpy.float64' object has no attribute 'log10'
                            
                                How to save changes in read-only Jupyter Notebook
                            
                                Changing the colour of text automatically inserted into tkinter widget
                            
                                How can I properly use a Pandas Dataframe with a multiindex that includes Intervals?
                            
                                Boxplot from dictionary with different length
                            
                                Concat column name with data of first row, Python 3.6 Dataframe
                            
                                Certificate Verification Failure for youtube-dl
                            
                                How to use the Alpha Vantage API directly from Python
                            
                                404 status code while making HTTP request via Python's "requests" library. However page is loading fine in browser
                            
                                ImportError: Cannot import site module and its dependencies: No module named site
                            
                                Plotting single data point using seaborn
                            
                                Python - What does it mean for a Class to be callable?
                            
                                Python Arrow milliseconds
                            
                                Measure the height of a string in Tkinter Python?
                            
                                Efficient computation of similarity matrix in Python (NumPy)
                            
                                Drop multiple columns that end with certain string in Pandas
                            
                                Inorder Traversal of Tree in Python Returning a List

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Close form solution for finding a root

Tags:

performance

python

pandas

numpy

numba

MCVE

What I've tried

Question

piRSquared

People also ask

2 Answers

`numba.njit`

`numpy`

`scipy.optimize.newton`

Time Trials

Results

piRSquared

Paul Panzer

Recent Activity

Donate For Us