Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

generating correlated numbers in numpy / pandas

I’m trying to generate simulated student grades in 4 subjects, where a student record is a single row of data. The code shown here will generate normally distributed random numbers with a mean of 60 and a standard deviation of 15.

df = pd.DataFrame(15 * np.random.randn(5, 4) + 60, columns=['Math', 'Science', 'History', 'Art'])

What I can’t figure out is how to make it so that a student’s Science mark is highly correlated to their Math mark, and that their History and Art marks are less so, but still somewhat correlated to the Math mark.

I’m neither a statistician or an expert programmer, so a less sophisticated but more easily understood solution is what I’m hoping for.

like image 614
Soundguy Avatar asked Aug 30 '17 05:08

Soundguy


People also ask

How do you generate a correlated random data?

To generate correlated normally distributed random samples, one can first generate uncorrelated samples, and then multiply them by a matrix C such that CCT=R, where R is the desired covariance matrix. C can be created, for example, by using the Cholesky decomposition of R, or from the eigenvalues and eigenvectors of R.

How do you create a correlation matrix in pandas?

corr() method is used for creating the correlation matrix. It is used to find the pairwise correlation of all columns in the dataframe.

How do you plot correlation in pandas?

You can plot correlation between two columns of pandas dataframe using sns. regplot(x=df['column_1'], y=df['column_2']) snippet. You can see the correlation of the two columns of the dataframe as a scatterplot.


2 Answers

Let's put what has been suggested by @Daniel into code.

Step 1

Let's import multivariate_normal:

import numpy as np
from scipy.stats import multivariate_normal as mvn

Step 2

Let's construct covariance data and generate data:

cov = np.array([[1, 0.8,.7, .6],[.8,1.,.5,.5],[0.7,.5,1.,.5],[0.6,.5,.5,1]])
cov

array([[ 1. ,  0.8,  0.7,  0.6],
       [ 0.8,  1. ,  0.5,  0.5],
       [ 0.7,  0.5,  1. ,  0.5],
       [ 0.6,  0.5,  0.5,  1. ]])

This is the key step. Note, that covariance matrix has 1's in diagonal, and the covariances decrease as you step from left to right.

Now we are ready to generate data, let's sat 1'000 points:

scores = mvn.rvs(mean = [60.,60.,60.,60.], cov=cov, size = 1000)

Sanity check (from covariance matrix to simple correlations):

np.corrcoef(scores.T):

array([[ 1.        ,  0.78886583,  0.70198586,  0.56810058],
       [ 0.78886583,  1.        ,  0.49187904,  0.45994833],
       [ 0.70198586,  0.49187904,  1.        ,  0.4755558 ],
       [ 0.56810058,  0.45994833,  0.4755558 ,  1.        ]])

Note, that np.corrcoef expects your data in rows.

Finally, let's put your data into Pandas' DataFrame:

df = pd.DataFrame(data = scores, columns = ["Math", "Science","History", "Art"])
df.head()

    Math        Science     History     Art
0   60.629673   61.238697   61.805788   61.848049
1   59.728172   60.095608   61.139197   61.610891
2   61.205913   60.812307   60.822623   59.497453
3   60.581532   62.163044   59.277956   60.992206
4   61.408262   59.894078   61.154003   61.730079

Step 3

Let's visualize some data that we've just generated:

ax = df.plot(x = "Math",y="Art", kind="scatter", color = "r", alpha = .5, label = "Art, $corr_{Math}$ = .6")
df.plot(x = "Math",y="Science", kind="scatter", ax = ax, color = "b", alpha = .2, label = "Science, $corr_{Math}$ = .8")
ax.set_ylabel("Art and Science");

enter image description here

like image 190
Sergey Bushmanov Avatar answered Oct 14 '22 22:10

Sergey Bushmanov


The statistical tool for that is the covariance matrix: https://en.wikipedia.org/wiki/Covariance. Each cell (i,j) is representing the dependecy between the variable i and the variable j, so in your case it can be between math and science. If there is no dependency the value would be 0.

What you did was assuming that the covariance was a diagonal matrix with the same values on the diagonal. So what you have to do is defines your covariance matrix and afterwards draw the samples from a gaussian with numpy.random.multivariate_normal https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multivariate_normal.html or any other distribution functions.

like image 41
Nathan Avatar answered Oct 14 '22 21:10

Nathan