Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compute the mean and the covariance of a large matrix(300000 x 70000)

I am using Numpy and trying to compute the mean and the covariance of a large matrix(300000 x 70000). I have 32GB-size memory avaiable. What's the best practice for this task in term of computational efficiency and easiness of implementation?

My current implementation is as follows:

def compute_mean_variance(mat, chunk_size):
    row_count = mat.row_count
    col_count = mat.col_count
    # maintain the `x_sum`, `x2_sum` array
    # mean(x) = x_sum / row_count
    # var(x) = x2_sum / row_count - mean(x)**2
    x_sum = np.zeros([1, col_count])
    x2_sum = np.zeros([1, col_count])

    for i in range(0, row_count, chunk_size):
        sub_mat = mat[i:i+chunk_size, :]
        # in-memory sub_mat of size chunk_size x num_cols
        sub_mat = sub_mat.read().val
        x_sum += np.sum(sub_mat, 0)
        x2_sum += x2_sum + np.sum(sub_mat**2, 0)
    x_mean = x_sum / row_count
    x_var = x2_sum / row_count - x_mean ** 2
    return x_mean, x_var

Any suggestions for improvements?

I find the following implementation should more understandable. Also it use numpy to calculate the mean and standard deviation for the chunks of columns. So it should be more efficient and numerically stable.

def compute_mean_std(mat, chunk_size):
    row_count = mat.row_count
    col_count = mat.col_count
    mean = np.zeros(col_count)
    std = np.zeros(col_count)

    for i in xrange(0, col_count, chunk_size):
        sub_mat = mat[:, i : i + chunk_size]
        # num_samples x chunk_size
        sub_mat = sub_mat.read().val
        mean[i : i + chunk_size] = np.mean(sub_mat, axis=0)
        std[i : i + chunk_size] = np.std(sub_mat, axis=0)

    return mean, std
like image 981
david Avatar asked Nov 20 '25 21:11

david


1 Answers

I assume that in order to calculate the variance, you are using what Wiki calls the Naïve algorithm. However, one might find there:

Because x2_sum / row_count and x_mean ** 2 can be very similar numbers, cancellation can lead to the precision of the result to be much less than the inherent precision of the floating-point arithmetic used to perform the computation. Thus this algorithm should not be used in practice. This is particularly bad if the standard deviation is small relative to the mean.

As an alternative, you might use the two-pass algorithm, i.e., to first calculate the mean and then use it in the calculation of the variance. In principle, it would seem that this is wasteful since one has to iterate twice over the data. However, the "mean" used in the calculation of the variance does not need to be the true mean, a reasonable estimate (perhaps calculated only from the first chunk) would suffice. This would then reduce to the method of assumed mean.

Also, a possibility would be to delegate the calculation of the mean/variance of each chunk directly to numpy and then combine them in order to obtain the overall mean/variance using the parallel algorithm.

like image 85
ewcz Avatar answered Nov 22 '25 12:11

ewcz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!