I noticed that Stack Overflow's number of users and their reputation follows an interesting distribution. I created a pandas DF to see if I could create a parametric fit:
import pandas as pd
import numpy as np
soDF = pd.read_excel('scores.xls')
print soDF
Which returns this:
total_rep users
0 1 4364226
1 200 269110
2 500 158824
3 1000 90368
4 2000 48609
5 3000 32604
6 5000 18921
7 10000 8618
8 25000 2802
9 50000 1000
10 100000 334
If I graph this, I obtain the following chart:
The distribution seems to follow a Power Law. So to better visualize it, I added the following:
soDF['log_total_rep'] = soDF['total_rep'].apply(np.log10)
soDF['log_users'] = soDF['users'].apply(np.log10)
soDF.plot(x='log_total_rep', y='log_users')
Which produced the following:
Is there an easy way to use pandas to find the best fit to this data? Although the fit looks linear, perhaps a polynomial fit is better since now I am dealing in logarithmic scales.
Pandas is more user-friendly, but NumPy is faster. Pandas has a lot more options for handling missing data, but NumPy has better performance on large datasets. Pandas uses Python objects internally, making it easier to work with than NumPy (which uses C arrays).
To convert Pandas DataFrame to Numpy Array, use the function DataFrame. to_numpy() . to_numpy() is applied on this DataFrame and the method returns object of type Numpy ndarray. Usually the returned ndarray is 2-dimensional.
python
, pandas
, and scipy
, oh my!The scientific python ecosystem has several complimentary libraries. No one library does everything, by design. pandas
provides tools to manipulate table-like data and timeseries. However, it deliberately doesn't include the type of functionality you're looking for.
For fitting statistical distributions, you'd typically use another package such as scipy.stats
.
However, in this case, we don't have the "raw" data (i.e. a long sequence of reputation scores). Instead we have something similar to a histogram. Therefore, we'll need to fit things at a bit lower level than scipy.stats.powerlaw.fit
.
For the moment, let's drop pandas
entirely. There aren't any advantages to using it here, and we'd quickly wind up converting the dataframe to other data structures anyway. pandas
is great, it's just overkill for this situation.
As a quick stand-alone example to reproduce your plot:
import matplotlib.pyplot as plt
total_rep = [1, 200, 500, 1000, 2000, 3000, 5000, 10000,
25000, 50000, 100000]
num_users = [4364226, 269110, 158824, 90368, 48609, 32604,
18921, 8618, 2802, 1000, 334]
fig, ax = plt.subplots()
ax.loglog(total_rep, num_users)
ax.set(xlabel='Total Reputation', ylabel='Number of Users',
title='Log-Log Plot of Stackoverflow Reputation')
plt.show()
Next, we need to know what we're working with. What we've plotted is similar to a histogram, in that it's raw counts of the number of users at a given reputation level. However, note the little "+" beside each bin the reputation table. This means that, for example, 2082 users have a reputation score of 25000 or greater.
Our data is basically an estimate of the complimentary cumulative distribution function (CCDF), in the same sense that a histogram is an estimate of the probability distribution function (PDF). We'll just need to normalize it by the total number of users in our sample to get an estimate of the CCDF. In this case, we can simply divide by the first element of num_users
. Reputation can never be less than 1, so 1 on the x-axis corresponds to a probability of 1 by definition. (In other cases, we'd need to estimate this number.) As an example:
import numpy as np
import matplotlib.pyplot as plt
total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
8618, 2802, 1000, 334])
ccdf = num_users.astype(float) / num_users.max()
fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', lw=2, marker='o',
clip_on=False, zorder=10)
ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
ylabel='Probability that Reputation is Greater than X')
plt.show()
You might wonder why we're converting things over to a "normalized" version. The simplest answer is that it's more useful. It allows us to say something that isn't directly related to our sample size. Tomorrow, the total number of Stackoverflow users (and the numbers at each reputation level) will be different. However, the total probability that any given user has a particular reputation won't have changed significantly. If we want to predict John Skeet's reputation (highest rep. user) when the site hits 5 million registered users, it's much easier to use the probabilities instead of raw counts.
Next, let's fit a power-law distribution to the CCDF. Again, if we had the "raw" data in the form of a long list of reputation scores, it would be best to use a statistical package to handle this. In particular, scipy.stats.powerlaw.fit
.
However, we don't have the raw data. The CCDF of a power-law distribution takes the form of ccdf = x**(-a + 1)
. Therefore, we'll fit a line in log-space, and we can get the a
parameter of the distribution from a = 1 - slope
.
For the moment, let's use np.polyfit
to fit the line. We'll need to handle the conversion back and forth from log-space by ourselves:
import numpy as np
import matplotlib.pyplot as plt
total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
8618, 2802, 1000, 334])
ccdf = num_users.astype(float) / num_users.max()
# Fit a line in log-space
logx = np.log(total_rep)
logy = np.log(ccdf)
params = np.polyfit(logx, logy, 1)
est = np.exp(np.polyval(params, logx))
fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
clip_on=False, zorder=10, label='Observations')
ax.plot(total_rep, est, color='salmon', label='Fit', ls='--')
ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
ylabel='Probability that Reputation is Greater than X')
plt.show()
There's an immediate problem with this fit. Our estimate states that there's a greater than 1 probability that users will have a reputation of 1. That's not possible.
The problem is that we let polyfit
choose the best-fit y-intercept for our line. If we take a look a params
in our code above, it's the second number:
In [11]: params
Out[11]: array([-0.81938338, 1.15955974])
By definition, the y-intercept should be 1. Instead, the best-fit intercept is about 1.16
. We need to fix that number, and only allow the slope to vary in the linear fit.
First off, note that log(1) --> 0
. Therefore, we actually want to force the y-intercept in log-space to be 0 instead of 1.
It's easiest to do this using np.linalg.lstsq
to solve for things instead of np.polyfit
. At any rate, you'd do something similar to:
import numpy as np
import matplotlib.pyplot as plt
total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
8618, 2802, 1000, 334])
ccdf = num_users.astype(float) / num_users.max()
# Fit a line with a y-intercept of 1 in log-space
logx = np.log(total_rep)
logy = np.log(ccdf)
slope, _, _, _ = np.linalg.lstsq(logx[:,np.newaxis], logy)
params = [slope, 0]
est = np.exp(np.polyval(params, logx))
fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
clip_on=False, zorder=10, label='Observations')
ax.plot(total_rep, est, color='salmon', label='Fit', ls='--')
ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
ylabel='Probability that Reputation is Greater than X')
plt.show()
Hmmm... Now we have a new problem. Our new line doesn't fit our data very well. This is a common problem with power-law distributions.
In real-life, observed distributions almost never exactly follow a power-law. However, their "long tails" often do. You can see this quite clearly in this dataset. If we were to exclude the first two data points (low-reputation/high-probability), we'd get a very different line and it would be a much better fit to the remaining data.
The fact that only the tail of the distribution follows a power-law explains why we weren't able to fit our data very well when we fixed the y-intercept.
There are a lot of different modified power-law models for what happens near a probability of 1, but they all follow a power-law to the right of some cutoff value. Based on our observed data, it looks like we could fit two lines: One to the right of a reputation of ~1000 and one to the left.
With that in mind, let's forget about the left-hand side of things and focus on the "long tail" on the right. We'll use np.polyfit
but exclude the left-most three points from the fit.
import numpy as np
import matplotlib.pyplot as plt
total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
8618, 2802, 1000, 334])
ccdf = num_users.astype(float) / num_users.max()
# Fit a line in log-space, excluding reputation <= 1000
logx = np.log(total_rep[total_rep > 1000])
logy = np.log(ccdf[total_rep > 1000])
params = np.polyfit(logx, logy, 1)
est = np.exp(np.polyval(params, logx))
fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
clip_on=False, zorder=10, label='Observations')
ax.plot(total_rep[total_rep > 1000], est, color='salmon', label='Fit', ls='--')
ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
ylabel='Probability that Reputation is Greater than X')
plt.show()
In this case, we have some additional data. Let's see how well each different fit predicts the top 5 user's reputation:
import numpy as np
import matplotlib.pyplot as plt
total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
8618, 2802, 1000, 334])
top_5_rep = [832131, 632105, 618926, 596889, 576697]
top_5_ccdf = np.array([1, 2, 3, 4, 5], dtype=float) / num_users.max()
ccdf = num_users.astype(float) / num_users.max()
# Previous fits
naive_params = [-0.81938338, 1.15955974]
fixed_intercept_params = [-0.68845134, 0]
long_tail_params = [-1.26172528, 5.24883471]
fits = [naive_params, fixed_intercept_params, long_tail_params]
fit_names = ['Naive Fit', 'Fixed Intercept Fit', 'Long Tail Fit']
fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
clip_on=False, zorder=10, label='Observations')
# Plot reputation of top 5 users
ax.loglog(top_5_rep, top_5_ccdf, ls='', marker='o', color='darkred',
zorder=10, label='Top 5 Users')
# Plot different fits
for params, name in zip(fits, fit_names):
x = [1, 1e7]
est = np.exp(np.polyval(params, np.log(x)))
ax.loglog(x, est, label=name, ls='--')
ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
ylabel='Probability that Reputation is Greater than X',
ylim=[1e-7, 1])
ax.legend()
plt.show()
Wow! They all do a pretty awful job! First off, this is a good reason to use the full series when fitting a distribution instead of just the binned data. However, the root of the problem is that a power-law distribution isn't a very good fit in this case. At first glance, it looks like an exponential distribution might be a better fit, but let's leave that for later.
As an example of how badly the different power-law fits over-predict the low-probability observations (i.e. the users with the highest rep), let's predict Jon Skeet's reputation with each model:
import numpy as np
# Jon Skeet's actual reputation
skeet_prob = 1.0 / 4364226
true_rep = 832131
# Previous fits
naive_params = [-0.81938338, 1.15955974]
fixed_intercept_params = [-0.68845134, 0]
long_tail_params = [-1.26172528, 5.24883471]
fits = [naive_params, fixed_intercept_params, long_tail_params]
fit_names = ['Naive Fit', 'Fixed Intercept Fit', 'Long Tail Fit']
for params, name in zip(fits, fit_names):
inv_params = [1 / params[0], -params[1]/params[0]]
est = np.exp(np.polyval(inv_params, np.log(skeet_prob)))
print '{}:'.format(name)
print ' Pred. Rep.: {}'.format(est)
print ''
print 'True Reputation: {}'.format(true_rep)
This yields:
Naive Fit:
Pred. Rep.: 522562573.099
Fixed Intercept Fit:
Pred. Rep.: 4412664023.88
Long Tail Fit:
Pred. Rep.: 11728612.2783
True Reputation: 832131
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With