generating correlated numbers in numpy / pandas

Tags:

I’m trying to generate simulated student grades in 4 subjects, where a student record is a single row of data. The code shown here will generate normally distributed random numbers with a mean of 60 and a standard deviation of 15.

df = pd.DataFrame(15 * np.random.randn(5, 4) + 60, columns=['Math', 'Science', 'History', 'Art'])

What I can’t figure out is how to make it so that a student’s Science mark is highly correlated to their Math mark, and that their History and Art marks are less so, but still somewhat correlated to the Math mark.

I’m neither a statistician or an expert programmer, so a less sophisticated but more easily understood solution is what I’m hoping for.

614

asked Aug 30 '17 05:08

Soundguy

2 Answers

Let's put what has been suggested by @Daniel into code.

Step 1

Let's import multivariate_normal:

import numpy as np
from scipy.stats import multivariate_normal as mvn

Step 2

Let's construct covariance data and generate data:

cov = np.array([[1, 0.8,.7, .6],[.8,1.,.5,.5],[0.7,.5,1.,.5],[0.6,.5,.5,1]])
cov

array([[ 1. ,  0.8,  0.7,  0.6],
       [ 0.8,  1. ,  0.5,  0.5],
       [ 0.7,  0.5,  1. ,  0.5],
       [ 0.6,  0.5,  0.5,  1. ]])

This is the key step. Note, that covariance matrix has 1's in diagonal, and the covariances decrease as you step from left to right.

Now we are ready to generate data, let's sat 1'000 points:

scores = mvn.rvs(mean = [60.,60.,60.,60.], cov=cov, size = 1000)

Sanity check (from covariance matrix to simple correlations):

np.corrcoef(scores.T):

array([[ 1.        ,  0.78886583,  0.70198586,  0.56810058],
       [ 0.78886583,  1.        ,  0.49187904,  0.45994833],
       [ 0.70198586,  0.49187904,  1.        ,  0.4755558 ],
       [ 0.56810058,  0.45994833,  0.4755558 ,  1.        ]])

Note, that np.corrcoef expects your data in rows.

Finally, let's put your data into Pandas' DataFrame:

df = pd.DataFrame(data = scores, columns = ["Math", "Science","History", "Art"])
df.head()

    Math        Science     History     Art
0   60.629673   61.238697   61.805788   61.848049
1   59.728172   60.095608   61.139197   61.610891
2   61.205913   60.812307   60.822623   59.497453
3   60.581532   62.163044   59.277956   60.992206
4   61.408262   59.894078   61.154003   61.730079

Step 3

Let's visualize some data that we've just generated:

ax = df.plot(x = "Math",y="Art", kind="scatter", color = "r", alpha = .5, label = "Art, $corr_{Math}$ = .6")
df.plot(x = "Math",y="Science", kind="scatter", ax = ax, color = "b", alpha = .2, label = "Science, $corr_{Math}$ = .8")
ax.set_ylabel("Art and Science");

enter image description here

190

answered Oct 14 '22 22:10

Sergey Bushmanov

The statistical tool for that is the covariance matrix: https://en.wikipedia.org/wiki/Covariance. Each cell (i,j) is representing the dependecy between the variable i and the variable j, so in your case it can be between math and science. If there is no dependency the value would be 0.

What you did was assuming that the covariance was a diagonal matrix with the same values on the diagonal. So what you have to do is defines your covariance matrix and afterwards draw the samples from a gaussian with numpy.random.multivariate_normal https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multivariate_normal.html or any other distribution functions.

answered Oct 14 '22 21:10

Nathan

Related questions
                            
                                Obtain tf-idf weights of words with sklearn
                            
                                Django Error ---index() missing 1 required positional argument: 'pk'
                            
                                Pythonic way to initialize an object with a lot of parameters and default value [duplicate]
                            
                                Efficiently initialize 2D array of size n*m in Python 3?
                            
                                Chaining Iterators To Flat Iterator
                            
                                Applying the python-geohash encode function on a dataframe
                            
                                Plotting graph using matplotlib in Jupyter iPython Notebook
                            
                                How I can apply groupby two times on pandas data frame?
                            
                                Python-pptx - Text parameters (font, size, position) on Autoshape
                            
                                Get model details from H2O model object
                            
                                Convert Base 64 String to BytesIO
                            
                                Formatting dict keys: AttributeError: 'dict' object has no attribute 'keys()'
                            
                                Are sympy matrices really that slow?
                            
                                if a == b or a == c: vs if a in {b, c}:
                            
                                How to get back the index after groupby in pandas
                            
                                Best practice for Python 3 class creation
                            
                                Efficiently get minimum values for each pair of elements from two arrays in a third array
                            
                                Python heapq : How do I sort the heap using nth element of the list of lists?
                            
                                Why is -2**2 == -4 but math.pow(-2, 2) == 4.0?
                            
                                Pandas: Accessing data with list of dates and DateTimeIndex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

generating correlated numbers in numpy / pandas

Tags:

python

pandas

numpy

statistics

correlation

Soundguy

People also ask

2 Answers

Sergey Bushmanov

Nathan

Recent Activity

Donate For Us