Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to translate an R wrapper around a C++ function to Python/Numpy

Tags:

c++

python

r

numpy

The R package Ckmeans.1d.dp relies on C++ code to do 99% of its work.

I want to use this functionality in Python without having to rely on RPy2. Therefore I want to "translate" the R wrapper to an analogous Python wrapper that operates on Numpy arrays the way the R code operates on R vectors. Is this possible? It seems like it should be, since the C++ code itself looks (to my untrained eye) like it stands up on its own.

However, the documentation for Cython doesn't really cover this use case, of wrapping a existing C++ with Python. It's briefly mentioned here and here, but I'm in way over my head since I've never worked with C++ before.

Here's my attempt, which fails with a slew of "Cannot assign type 'double' to 'double *' errors:

Directory structure

.
├── Ckmeans.1d.dp  # clone of https://github.com/cran/Ckmeans.1d.dp
├── ckmeans
│   ├── __init__.py
│   └── _ckmeans.pyx
├── setup.py
└── src
    └── Ckmeans.1d.dp_pymain.cpp

src/Ckmeans.1d.dp_pymain.cpp

#include "../Ckmeans.1d.dp/src/Ckmeans.1d.dp.h"
static void Ckmeans_1d_dp(double *x, int* length, double *y, int * ylength,
                          int* minK, int *maxK, int* cluster,
                          double* centers, double* withinss, int* size)
{
    // Call C++ version one-dimensional clustering algorithm*/
    if(*ylength != *length) { y = 0; }

    kmeans_1d_dp(x, (size_t)*length, y, (size_t)(*minK), (size_t)(*maxK),
                    cluster, centers, withinss, size);

    // Change the cluster numbering from 0-based to 1-based
    for(size_t i=0; i< *length; ++i) {
        cluster[i] ++;
    }
}

ckmeans/init.py

from ._ckmeans import ckmeans

ckmeans/_ckmeans.pyx

cimport numpy as np
import numpy as np
from .ckmeans import ClusterResult

cdef extern from "../src/Ckmeans.1d.dp_pymain.cpp":
    void Ckmeans_1d_dp(double *x, int* length,
                       double *y, int * ylength,
                       int* minK, int *maxK,
                       int* cluster, double* centers, double* withinss, int* size)

def ckmeans(np.ndarray[np.double_t, ndim=1] x, int* min_k, int* max_k):
    cdef int n_x = len(x)
    cdef double y = np.repeat(1, N)
    cdef int n_y = len(y)
    cdef double cluster
    cdef double centers
    cdef double within_ss
    cdef int sizes
    Ckmeans_1d_dp(x, n_x, y, n_y, min_k, max_k, cluster, centers, within_ss, sizes)
    return (np.array(cluster), np.array(centers), np.array(within_ss), np.array(sizes))
like image 851
shadowtalker Avatar asked Nov 09 '22 13:11

shadowtalker


1 Answers

The cdef extern part is correct. The problem (as pointed out by Mihai Todor in the comments in 2016) is that I was not passing pointers into the Ckmeans_1d_dp function.

Cython uses the same "address-of" & syntax as C for getting a pointer, e.g. &x is a pointer to x.

In order to get a pointer to a Numpy array, you should take the address of the first element of the array, as in &x[0] for the array x. It is important to ensure that arrays are contiguous in memory (sequential elements have sequential addresses), because this is how arrays are laid out in C and C++; iterating over an array amounts to incrementing a pointer.

The working definition of ckmeans() in _ckmeans.pyx looked something like this:

def ckmeans(
    np.ndarray[np.float64_t] x,
    int min_k,
    int max_k,
    np.ndarray[np.float64_t] weights
):
    # Ensure input arrays are contiguous; if the input data is not
    # already contiguous and in C order, this might make a copy!
    x = np.ascontiguousarray(x, dtype=np.dtype('d'))
    y = np.ascontiguousarray(weights, dtype=np.dtype('d'))

    cdef int n_x = len(x)
    cdef int n_weights = len(weights)

    # Ouput: cluster membership for each element
    cdef np.ndarray[int, ndim=1] clustering = np.ascontiguousarray(np.empty((n_x,), dtype=ctypes.c_int))

    # Outputs: results for each cluster
    # Pre-allocate these for max k, then truncate later
    cdef np.ndarray[np.double_t, ndim=1] centers = np.ascontiguousarray(np.empty((max_k,), dtype=np.dtype('d')))
    cdef np.ndarray[np.double_t, ndim=1] within_ss = np.ascontiguousarray(np.zeros((max_k,), dtype=np.dtype('d')))
    cdef np.ndarray[int, ndim=1] sizes = np.ascontiguousarray(np.zeros((max_k,), dtype=ctypes.c_int))

    # Outputs: overall clustering stats
    cdef double total_ss = 0
    cdef double between_ss = 0

    # Call the 'cdef extern' function
    _ckmeans.Ckmeans_1d_dp(
        &x[0],
        &n_x,
        &weights[0],
        &n_weights,
        &min_k,
        &max_k,
        &clustering[0],
        &centers[0],
        &within_ss[0],
        &sizes[0],
    )

    # Calculate overall clustering stats
    if n_x == n_weights and y.sum() != 0:
        total_ss = np.sum(y * (x - np.sum(x * weights) / weights.sum()) ** 2)
    else:
        total_ss = np.sum((x - x.sum() / n_x) ** 2)
    between_ss = total_ss - within_ss.sum()

    # Extract final the number of clusters from the results.
    # We initialized sizes as a vector of 0's, and cluster size can never be
    # zero, so we know that any 0 size element is an empty/unused cluster.
    cdef int k = np.sum(sizes > 0)

    # Truncate output arrays to remove unused clusters
    centers = centers[:k]
    within_ss = within_ss[:k]
    sizes = sizes[:k]

    # Change the clustering back to 0-indexed, because
    # the R wrapper changes it to 1-indexed.
    return (
        clustering - 1,
        k,
        centers,
        sizes,
        within_ss,
        total_ss,
        between_ss
    )

Note that this particular R package now has a Python wrapper: https://github.com/djdt/ckwrap.

like image 77
shadowtalker Avatar answered Nov 14 '22 21:11

shadowtalker