so I am currently trying to do something like A**b for some 2d ndarray and a double b in parallel for Python. I would like to do it with a C extension using OpenMP (yes I know, there is Cython etc. but at some point I always ran into trouble with those 'high-level' approaches...).
So here is the gaussian.c Code for my gaussian.so:
void scale(const double *A, double *out, int n) {
int i, j, ind1, ind2;
double power, denom;
power = 10.0 / M_PI;
denom = sqrt(M_PI);
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
(A is a square double Matrix, out has the same shape and n is the number of rows/columns) So the point is to update some symmetric distance matrix - ind2 is the transposed index of ind1.
I compile it using gcc -shared -fopenmp -o gaussian.so -lm gaussian.c. I access the function directly via ctypes in Python:
test = c_gaussian.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'), # array of sample
ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'), # array of sampl
ctypes.c_int # number of samples
]
The function 'test' is working smoothly as long as I comment the #pragma line - otherwise it ends with error number 139.
A = np.random.rand(1000, 1000) + 2.0
out = np.empty((1000, 1000))
test(A, out, 1000)
When I change the inner loop to just print ind1 and ind2 it runs smoothly in parallel. It also works, when I just access the ind1 location and leave ind2 alone (even in parallel)! Where do I screw up the memory access? How can I fix this?
thank you!
Update: Well I guess this is running into the GIL, but I am not yet sure...
Update: Okay, I am pretty sure now, that it is evil GIL killing me here, so I altered the example:
I now have gil.c:
#include <Python.h>
#define _USE_MATH_DEFINES
#include <math.h>
void scale(const double *A, double *out, int n) {
int i, j, ind1, ind2;
double power, denom;
power = 10.0 / M_PI;
denom = sqrt(M_PI);
Py_BEGIN_ALLOW_THREADS
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
Py_END_ALLOW_THREADS
}
which is compiled using gcc -shared -fopenmp -o gil.so -lm gil.c -I /usr/include/python2.7 -L /usr/lib/python2.7/ -lpython2.7 and the corresponding Python file:
import ctypes
import numpy as np
from numpy.ctypeslib import ndpointer
import pylab as pl
path = '../src/gil.so'
c_gil = ctypes.cdll.LoadLibrary(path)
test = c_gil.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'),
ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'),
ctypes.c_int
]
n = 100
A = np.random.rand(n, n) + 2.0
out = np.empty((n,n))
test(A, out, n)
This gives me
Fatal Python error: PyEval_SaveThread: NULL tstate
Process finished with exit code 134
Now somehow it seems to not be able to save the current thread - but the API doc does not go into detail here, I was hoping that I could ignore Python when writing my C function, but this seems to be quite messy :( any ideas? I found this very helpful: GIL
Your problem is much simpler than you think and does not involve GIL in any way. You are running in an out-of-bound access to out[] when you access it via ind2 since j easily becomes larger than n. The reason is simply that you have not applied any data sharing clause to your parallel region and all variables except i remain shared (as per default in OpenMP) and therefore subject to data races - in that case multiple simultaneous increments being done by the different threads. Having too large j is less of a problem with ind1, but not with ind2 since there the too large value is multiplied by n and thus becomes far too large.
Simply make j, ind1 and ind2 private as they should be:
#pragma omp parallel for private(j,ind1,ind2)
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
Even better, declare them inside the scope where they are being used. That automatically makes them private:
#pragma omp parallel for
for (i = 0; i < n; i++) {
int j;
for (j = i; j < n; j++) {
int ind1 = i*n + j;
int ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With