Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using pandas and numpy to parametrize stack overflow's number of users and reputation

I noticed that Stack Overflow's number of users and their reputation follows an interesting distribution. I created a pandas DF to see if I could create a parametric fit:

import pandas as pd
import numpy as np
soDF = pd.read_excel('scores.xls')
print soDF

Which returns this:

    total_rep    users
0           1  4364226
1         200   269110
2         500   158824
3        1000    90368
4        2000    48609
5        3000    32604
6        5000    18921
7       10000     8618
8       25000     2802
9       50000     1000
10     100000      334

If I graph this, I obtain the following chart:

stack overflow users and reputation

The distribution seems to follow a Power Law. So to better visualize it, I added the following:

soDF['log_total_rep'] = soDF['total_rep'].apply(np.log10)
soDF['log_users'] = soDF['users'].apply(np.log10)
soDF.plot(x='log_total_rep', y='log_users')

Which produced the following: Stack Overflow users and reputation follows a power law

Is there an easy way to use pandas to find the best fit to this data? Although the fit looks linear, perhaps a polynomial fit is better since now I am dealing in logarithmic scales.

like image 521
Luis Miguel Avatar asked Dec 26 '15 16:12

Luis Miguel


People also ask

Is Panda faster than NP?

Pandas is more user-friendly, but NumPy is faster. Pandas has a lot more options for handling missing data, but NumPy has better performance on large datasets. Pandas uses Python objects internally, making it easier to work with than NumPy (which uses C arrays).

How do you convert a DataFrame to a matrix in python?

To convert Pandas DataFrame to Numpy Array, use the function DataFrame. to_numpy() . to_numpy() is applied on this DataFrame and the method returns object of type Numpy ndarray. Usually the returned ndarray is 2-dimensional.


1 Answers

python, pandas, and scipy, oh my!

The scientific python ecosystem has several complimentary libraries. No one library does everything, by design. pandas provides tools to manipulate table-like data and timeseries. However, it deliberately doesn't include the type of functionality you're looking for.

For fitting statistical distributions, you'd typically use another package such as scipy.stats.

However, in this case, we don't have the "raw" data (i.e. a long sequence of reputation scores). Instead we have something similar to a histogram. Therefore, we'll need to fit things at a bit lower level than scipy.stats.powerlaw.fit.


Stand-alone example

For the moment, let's drop pandas entirely. There aren't any advantages to using it here, and we'd quickly wind up converting the dataframe to other data structures anyway. pandas is great, it's just overkill for this situation.

As a quick stand-alone example to reproduce your plot:

import matplotlib.pyplot as plt

total_rep = [1, 200, 500, 1000, 2000, 3000, 5000, 10000,
             25000, 50000, 100000]
num_users = [4364226, 269110, 158824, 90368, 48609, 32604, 
             18921, 8618, 2802, 1000, 334]

fig, ax = plt.subplots()
ax.loglog(total_rep, num_users)
ax.set(xlabel='Total Reputation', ylabel='Number of Users',
       title='Log-Log Plot of Stackoverflow Reputation')
plt.show()

enter image description here


What does this data represent?

Next, we need to know what we're working with. What we've plotted is similar to a histogram, in that it's raw counts of the number of users at a given reputation level. However, note the little "+" beside each bin the reputation table. This means that, for example, 2082 users have a reputation score of 25000 or greater.

Our data is basically an estimate of the complimentary cumulative distribution function (CCDF), in the same sense that a histogram is an estimate of the probability distribution function (PDF). We'll just need to normalize it by the total number of users in our sample to get an estimate of the CCDF. In this case, we can simply divide by the first element of num_users. Reputation can never be less than 1, so 1 on the x-axis corresponds to a probability of 1 by definition. (In other cases, we'd need to estimate this number.) As an example:

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

ccdf = num_users.astype(float) / num_users.max()

fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', lw=2, marker='o',
          clip_on=False, zorder=10)
ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X')
plt.show()

enter image description here

You might wonder why we're converting things over to a "normalized" version. The simplest answer is that it's more useful. It allows us to say something that isn't directly related to our sample size. Tomorrow, the total number of Stackoverflow users (and the numbers at each reputation level) will be different. However, the total probability that any given user has a particular reputation won't have changed significantly. If we want to predict John Skeet's reputation (highest rep. user) when the site hits 5 million registered users, it's much easier to use the probabilities instead of raw counts.

Naive fit of a power-law distribution

Next, let's fit a power-law distribution to the CCDF. Again, if we had the "raw" data in the form of a long list of reputation scores, it would be best to use a statistical package to handle this. In particular, scipy.stats.powerlaw.fit.

However, we don't have the raw data. The CCDF of a power-law distribution takes the form of ccdf = x**(-a + 1). Therefore, we'll fit a line in log-space, and we can get the a parameter of the distribution from a = 1 - slope.

For the moment, let's use np.polyfit to fit the line. We'll need to handle the conversion back and forth from log-space by ourselves:

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

ccdf = num_users.astype(float) / num_users.max()

# Fit a line in log-space
logx = np.log(total_rep)
logy = np.log(ccdf)
params = np.polyfit(logx, logy, 1)
est = np.exp(np.polyval(params, logx))

fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
          clip_on=False, zorder=10, label='Observations')

ax.plot(total_rep, est, color='salmon', label='Fit', ls='--')

ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X')

plt.show()

enter image description here

There's an immediate problem with this fit. Our estimate states that there's a greater than 1 probability that users will have a reputation of 1. That's not possible.

The problem is that we let polyfit choose the best-fit y-intercept for our line. If we take a look a params in our code above, it's the second number:

In [11]: params
Out[11]: array([-0.81938338,  1.15955974])

By definition, the y-intercept should be 1. Instead, the best-fit intercept is about 1.16. We need to fix that number, and only allow the slope to vary in the linear fit.

Fixing the y-intercept in the fit

First off, note that log(1) --> 0. Therefore, we actually want to force the y-intercept in log-space to be 0 instead of 1.

It's easiest to do this using np.linalg.lstsq to solve for things instead of np.polyfit. At any rate, you'd do something similar to:

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

ccdf = num_users.astype(float) / num_users.max()

# Fit a line with a y-intercept of 1 in log-space
logx = np.log(total_rep)
logy = np.log(ccdf)
slope, _, _, _ = np.linalg.lstsq(logx[:,np.newaxis], logy)

params = [slope, 0]
est = np.exp(np.polyval(params, logx))

fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
          clip_on=False, zorder=10, label='Observations')

ax.plot(total_rep, est, color='salmon', label='Fit', ls='--')

ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X')

plt.show()

enter image description here

Hmmm... Now we have a new problem. Our new line doesn't fit our data very well. This is a common problem with power-law distributions.

Use only the "tails" in the fit

In real-life, observed distributions almost never exactly follow a power-law. However, their "long tails" often do. You can see this quite clearly in this dataset. If we were to exclude the first two data points (low-reputation/high-probability), we'd get a very different line and it would be a much better fit to the remaining data.

The fact that only the tail of the distribution follows a power-law explains why we weren't able to fit our data very well when we fixed the y-intercept.

There are a lot of different modified power-law models for what happens near a probability of 1, but they all follow a power-law to the right of some cutoff value. Based on our observed data, it looks like we could fit two lines: One to the right of a reputation of ~1000 and one to the left.

With that in mind, let's forget about the left-hand side of things and focus on the "long tail" on the right. We'll use np.polyfit but exclude the left-most three points from the fit.

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

ccdf = num_users.astype(float) / num_users.max()

# Fit a line in log-space, excluding reputation <= 1000
logx = np.log(total_rep[total_rep > 1000])
logy = np.log(ccdf[total_rep > 1000])

params = np.polyfit(logx, logy, 1)
est = np.exp(np.polyval(params, logx))

fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
          clip_on=False, zorder=10, label='Observations')

ax.plot(total_rep[total_rep > 1000], est, color='salmon', label='Fit', ls='--')

ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X')

plt.show()

enter image description here

Test the different fits

In this case, we have some additional data. Let's see how well each different fit predicts the top 5 user's reputation:

import numpy as np
import matplotlib.pyplot as plt

total_rep = np.array([1, 200, 500, 1000, 2000, 3000, 5000, 10000,
                      25000, 50000, 100000])
num_users = np.array([4364226, 269110, 158824, 90368, 48609, 32604, 18921,
                      8618, 2802, 1000, 334])

top_5_rep = [832131, 632105, 618926, 596889, 576697]
top_5_ccdf = np.array([1, 2, 3, 4, 5], dtype=float) / num_users.max()

ccdf = num_users.astype(float) / num_users.max()

# Previous fits
naive_params = [-0.81938338,  1.15955974]
fixed_intercept_params = [-0.68845134, 0]
long_tail_params = [-1.26172528, 5.24883471]

fits = [naive_params, fixed_intercept_params, long_tail_params]
fit_names = ['Naive Fit', 'Fixed Intercept Fit', 'Long Tail Fit']


fig, ax = plt.subplots()
ax.loglog(total_rep, ccdf, color='lightblue', ls='', marker='o',
          clip_on=False, zorder=10, label='Observations')

# Plot reputation of top 5 users
ax.loglog(top_5_rep, top_5_ccdf, ls='', marker='o', color='darkred',
          zorder=10, label='Top 5 Users')

# Plot different fits
for params, name in zip(fits, fit_names):
    x = [1, 1e7]
    est = np.exp(np.polyval(params, np.log(x)))
    ax.loglog(x, est, label=name, ls='--')

ax.set(xlabel='Reputation', title='CCDF of Stackoverflow Reputation',
       ylabel='Probability that Reputation is Greater than X',
       ylim=[1e-7, 1])
ax.legend()

plt.show()

enter image description here

Wow! They all do a pretty awful job! First off, this is a good reason to use the full series when fitting a distribution instead of just the binned data. However, the root of the problem is that a power-law distribution isn't a very good fit in this case. At first glance, it looks like an exponential distribution might be a better fit, but let's leave that for later.

As an example of how badly the different power-law fits over-predict the low-probability observations (i.e. the users with the highest rep), let's predict Jon Skeet's reputation with each model:

import numpy as np

# Jon Skeet's actual reputation
skeet_prob = 1.0 / 4364226
true_rep = 832131

# Previous fits
naive_params = [-0.81938338,  1.15955974]
fixed_intercept_params = [-0.68845134, 0]
long_tail_params = [-1.26172528, 5.24883471]

fits = [naive_params, fixed_intercept_params, long_tail_params]
fit_names = ['Naive Fit', 'Fixed Intercept Fit', 'Long Tail Fit']

for params, name in zip(fits, fit_names):
    inv_params = [1 / params[0], -params[1]/params[0]]
    est = np.exp(np.polyval(inv_params, np.log(skeet_prob)))

    print '{}:'.format(name)
    print '    Pred. Rep.: {}'.format(est)
    print ''

print 'True Reputation: {}'.format(true_rep)

This yields:

Naive Fit:
    Pred. Rep.: 522562573.099

Fixed Intercept Fit:
    Pred. Rep.: 4412664023.88

Long Tail Fit:
    Pred. Rep.: 11728612.2783

True Reputation: 832131
like image 130
Joe Kington Avatar answered Oct 14 '22 20:10

Joe Kington