Building a covariance matrix in Python

Tags:

Problem I want to implement an algorithm from an unpublished paper by my supervisor and as part of that, I need to construct a covariance matrix C using some rules given in the paper. I'm coming from Matlab and wanted to take this opportunity to finally learn Python, hence my question: How do I do this in the most efficient (fast) way in Python (including numpy,scipy)?

Subproblem 1:

Option 1: I use 2 for loops , looping over all rows and all columns. I would assume that's the worst thing to do.
Option 2: Using list comprehension, I construct a list of euclidean pairs and then iterate over that list. That's what I'm doing now.

Is there any better way?

Subproblem 2

Option 1: I iterate over all elements in the matrix.
Option 2: I iterate only over the lower triangular part (without diagonal), then add the transpose (because covariance matrices are symmetric) and then add the diagonal.

I'm fairly convinced subproblem 1 is a no-brainer but I don't know about subproblem 2. I should probably also say that the matrix I'm dealing with is probably 2*10^4 x 2*10^4.

Thanks!

Edit I prefer not to give the actual covariance matrix but since people want to have an example, let's say we want to construct the Covariance matrix of a stochastic process called 'Brownian bridge'. It's structure is given by:

cov(Xs, Xt) = min{s,t} − st

for let's say s,t ∈ {1,...,100}. How would you build it?

509

asked Nov 05 '15 15:11

sbm

1 Answers

First off, for others who may come across this question in the future: If you did have data and were wanting to estimate a covariance matrix, as several people have noted, use np.cov or something similar.

Building Arrays From Patterns

However, your question is about how to build a large matrix given some pre-defined rules. To clear up some confusion in the comments: Your question doesn't seem to be about estimating a covariance matrix, it's about specifying one. In other words, you're asking how to build up a large array given some pre-defined rules.

Which way is most efficient is going to depend on what you're doing in detail. Most performance tricks in this case will involve exploiting symmetry in the calculation you're preforming. (For example, is one row going to be identical?)

It's hard to say anything specific without knowing exactly what you're doing. Therefore, I'll focus on how to do this type of thing in general. (Note: I just noticed your edit. I'll include an example for a Brownian Bridge in just a bit...)

Constant (or simple) Row/Column

The most basic case is a constant row or column in the output array. It's easy to create the array and assign values to a column or row using slicing syntax:

import numpy as np

num_vars = 10**4
cov = np.zeros((num_vars, num_vars), dtype=float)

To set an entire column/row:

# Third column will be all 9's
cov[:,2] = 9

# Second row will be all 1's (will overwrite the 9 in col3)
cov[1,:] = 1

You can also assign arrays to columns/rows:

# 5th row will have random values
cov[4,:] = np.random.random(num_vars)

# 6th row will have a simple geometric sequence
cov[5,:] = np.arange(num_vars)**2

Stacking Arrays

In many cases, (but probably not this exact case) you'll want to build up your output from existing arrays. You can use vstack/hstack/column_stack/tile and many other similar functions for this.

A good example is if we're setting up a matrix for a linear inversion of a polynomial:

import numpy as np

num = 10
x = np.random.random(num) # Observation locations

# "Green's functions" for a second-order polynomial
# at our observed locations
A = np.column_stack([x**i for i in range(3)])

However, this will build up several temporary arrays (three, in this case). If we were working with at 10000-dimensional polynomial with 10^6 observations, the approach above would use too much RAM. Therefore, you might iterate over columns instead:

ndim = 2
A = np.zeros((x.size, ndim + 1), dtype=float)
for j in range(ndim + 1):
    A[:,j] = x**j

In most cases, don't worry about the temporary arrays. The colum_stack-based example is the right way to go unless you're working with relatively large arrays.

The most general approach

Without any more information, we can't exploit any sort of symmetry. The most general way is to just iterate through. Typically you'll want to avoid this approach, but sometimes it's unavoidable (especially if the calculation depends on a previous value).

Speed-wise this is identical to nested for loops, but it's easier (especially for >2D arrays) to use np.ndindex instead of multiple for loops:

import numpy as np

num_vars = 10**4
cov = np.zeros((num_vars, num_vars), dtype=float)
for i, j in np.ndindex(cov.shape):
    # Logic presumably in some function...
    cov[i, j] = calculate_value(i, j)

Vectoring Index-based Calculations

If many cases, you can vectorize index-based calculations. In other words, operate directly on arrays of the indices of your output.

Let's say we had code that looked like:

import numpy as np

cov = np.zeros((10, 10)), dtype=float)
for i, j in np.ndindex(cov.shape):
    cov[i,j] = i*j - i

We could replace that with:

i, j = np.mgrid[:10, :10]
cov = i*j - i

As another example, let's build up a 100 x 100 "inverted cone" of values:

# The complex numbers in "mgrid" give the number of increments
# mgrid[min:max:num*1j, min:max:num*1j] is similar to
# meshgrid(linspace(min, max, num), linspace(min, max, num))
y, x = np.mgrid[-5:5:100j, -5:5:100j]

# Our "inverted cone" is just the distance from 0
r = np.hypot(x, y)

Brownian Bridge

This is a good example of something that can be easily vectorized. If I'm reading your example correctly, you'd want something similar to:

import numpy as np

st = np.mgrid[1:101, 1:101]
s, t = st
cov = st.min(axis=0) - s * t

Overall, I've only touched on a few general patterns. However, hopefully this gets you pointed in the right direction.

answered Sep 20 '22 11:09

Joe Kington

Related questions
                            
                                Creating List From File In Python
                            
                                find last occurence of multiple characters in a string in Python
                            
                                plot entire row on pandas
                            
                                custom URLs using django rest framework
                            
                                Python 2.7 & ANTLR4 : Make ANTLR throw exceptions on invalid input
                            
                                is it possible to pass request.body via post request in django test client?
                            
                                How to build a chi-square distribution table
                            
                                Python, distinguishing custom exceptions
                            
                                How to add custom font in python-flask?
                            
                                What does a star * alone mean in a function declaration? [duplicate]
                            
                                Alternative in python to subprocess
                            
                                How to escape single quote in xpath 1.0 in selenium for python
                            
                                How to formally insert URL space (%20) using Python? [duplicate]
                            
                                classification: PCA and logistic regression using sklearn
                            
                                Output SQL as string from pandas.DataFrame.to_sql
                            
                                Grouping and auto increment based on columns in pandas
                            
                                Trouble setting request specific timeout in Elasticsearch DSL
                            
                                Python Statsmodels QuantReg Intercept
                            
                                Choosing order of bars in Bokeh bar chart
                            
                                AttributeError: ‘module’ object has no attribute 'scores'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Building a covariance matrix in Python

Tags:

python

numpy

scipy

covariance

gaussian

sbm

People also ask