Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to replicate scipy.stats.fit using optimization function?

I am trying to fit a distribution to some values. This is my code

from __future__ import print_function
import pandas as pd
import numpy as np
import scipy as sp
import scipy.optimize as opt
import scipy.stats
import matplotlib.pyplot as plt

    values = np.random.pareto(1.5, 10000)
    loc = values.min()
    scale = 1


    def cost_function(alpha):
        cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).pdf(values)
        return cost.sum()

    opt_res = opt.fmin(cost_function, 1.5)

    alpha_fit_v = sp.stats.pareto.fit(values, floc=loc, fscale=scale)


    print('opt_res = ', opt_res,
          ' alpha_fit_v = ', alpha_fit_v)

I was expecting alpha_fit_v to be equivalent to opt_res but it is not. What's wrong?.

like image 256
Donbeo Avatar asked May 26 '26 04:05

Donbeo


1 Answers

What's wrong?.

  1. The cost function is wrong.
  2. np.random.pareto has a different distribution than sp.stats.pareto

1. The cost function is wrong

It does not make sense to sum inverse probabilities. You need to use the logarithm:

def cost_function(alpha):
    cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).logpdf(values)
    return cost.sum()

2. np.random.pareto has a different distribution than sp.stats.pareto

This one is tricky, but you may have noticed that not even sp.stats.pareto.fit returns the correct result. This is because scipy's Pareto distribution cannot fit the data generated by numpy.

import matplpotlib.pyplot as plt
import scipys as sp
import numpy as np

plt.subplot(2, 1, 1)
plt.hist(np.random.pareto(1.5, 10000), 1000)  # This is a Lomax or Pareto II distribution
plt.xlim(0, 10)

plt.subplot(2, 1, 2)
plt.hist(sp.stats.pareto.rvs(1.5, size=1000), 1000)  # This is a Pareto distribution
plt.xlim(0, 10)

distributions

That said, this will work as expected:

values = sp.stats.pareto.rvs(1.5, size=1000)
loc = 0
scale = 1

def cost_function(alpha):
    cost = -sp.stats.pareto(alpha, loc=loc, scale=scale).logpdf(values)
    return cost.sum()

opt_res = opt.fmin(cost_function, 1.5)

alpha_fit_v = sp.stats.pareto.fit(values, floc=loc, fscale=scale)

print('opt_res = ', opt_res,
      ' alpha_fit_v = ', alpha_fit_v)

# opt_res =  [ 1.49611816]  alpha_fit_v =  (1.4960937500000013, 0, 1)

According to the documentation numpy.random.pareto does not quite draw from the Pareto distribution:

Draw samples from a Pareto II or Lomax distribution with specified shape.

The Lomax or Pareto II distribution is a shifted Pareto distribution. The classical Pareto distribution can be obtained from the Lomax distribution by adding 1 and multiplying by the scale parameter m (see Notes).

So you have two alternatives if using numpy to generate the data:

  • You can set loc=-1 for the scipy distribution.
  • You can do values = np.random.pareto(1.5, 10000) + 1 and set loc=0.
like image 121
MB-F Avatar answered May 28 '26 17:05

MB-F