There are examples for creating custom numpy dtypes using C here: Additionally, it seems to be possible to create custom ufuncs in cython: It seems like it should also be possible to create a dtype using cython (and then create custom ufuncs for it). Is it possible? If so, can you post an example? USE CASE: I want to do some survival analysis. The basic data elements are survival times (floats) with associated censor values (False if the associated time represents a failure time and True if it instead represents a censoring time (i.e., no failure occurred during the period of observation)). Obviously I could just use two numpy arrays to store these values: a float array for the times and a bool array for the censor values. However, I want to account for the possibility of an event occurring multiple times (this is a good model for, say, heart attacks - you can have more than one). In this case, I need an array of objects which I call <code>MultiEvent</code>s. Each <code>MultiEvent</code> contains a sequence of floats (uncensored failure times) and an observation period (also a float). Note that the number of failures is not the same for all <code>MultiEvent</code>s. I need to be able to perform a few operations on an array of <code>MultiEvent</code>s: <ol> <li>Get the number of failures for each</li> <li>Get the censored time (that is the period of observation minus the sum of all failure times)</li> <li> Calculate a log likelihood based on additional arrays of parameters (such as an array of hazard values). For example, the log likelihood for a single <code>MultiEvent</code> <code>M</code> and constant hazard value <code>h</code> would be something like: <code>sum(log(h) + h*t for t in M.times) - h*(M.period - sum(M.times))</code> </li> </ol> where <code>M.times</code> is the list (array, whatever) of failure times and <code>M.period</code> is the total observation period. I want the proper numpy broadcasting rules to apply, so that I can do: <pre class="prettyprint"><code>log_lik = logp(M_vec,h_vec) </code></pre> and it will work as long as the dimensions of <code>M_vec</code> and <code>h_vec</code> are compatible. My current implementation uses <code>numpy.vectorize</code>. That works well enough for 1 and 2, but it is too slow for 3. Note also that I can't do this because the number of failures in my MultiData objects is not known ahead of time.

Numpy arrays are most suitable for data types with fixed size. If the objects in the array are not fixed size (such as your MultiEvent) the operations can become much slower. I would recommend you to store all of the survival times in a 1d linear record array with 3 fields: event_id, time, period. Each event can appear mutliple times in the array: <pre class="prettyprint"><code>>>> import numpy as np >>> rawdata = [(1, 0.4, 4), (1, 0.6, 6), (2,2.6, 6)] >>> npdata = np.rec.fromrecords(rawdata, names='event_id,time,period') >>> print npdata [(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6) (2, 2.6000000000000001, 6)] </code></pre> To get data for a specific index you could use fancy indexing: <pre class="prettyprint"><code>>>> eventdata = npdata[npdata.event_id==1] >>> print eventdata [(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6)] </code></pre> The advantage of this approach is that you can easily intergrate it with your ndarray-based functions. You can also access this arrays from cython as described in the manual: <pre class="prettyprint"><code>cdef packed struct Event: np.int32_t event_id np.float64_t time np.float64_6 period def f(): cdef np.ndarray[Event] b = np.zeros(10, dtype=np.dtype([('event_id', np.int32), ('time', np.float64), ('period', np.float64)])) <...> </code></pre>

How to create a custom numpy dtype using cython

Tags:

There are examples for creating custom numpy dtypes using C here:

Additionally, it seems to be possible to create custom ufuncs in cython:

It seems like it should also be possible to create a dtype using cython (and then create custom ufuncs for it). Is it possible? If so, can you post an example?

USE CASE:

I want to do some survival analysis. The basic data elements are survival times (floats) with associated censor values (False if the associated time represents a failure time and True if it instead represents a censoring time (i.e., no failure occurred during the period of observation)).

Obviously I could just use two numpy arrays to store these values: a float array for the times and a bool array for the censor values. However, I want to account for the possibility of an event occurring multiple times (this is a good model for, say, heart attacks - you can have more than one). In this case, I need an array of objects which I call MultiEvents. Each MultiEvent contains a sequence of floats (uncensored failure times) and an observation period (also a float). Note that the number of failures is not the same for all MultiEvents.

I need to be able to perform a few operations on an array of MultiEvents:

Get the number of failures for each
Get the censored time (that is the period of observation minus the sum of all failure times)
Calculate a log likelihood based on additional arrays of parameters (such as an array of hazard values). For example, the log likelihood for a single MultiEvent M and constant hazard value h would be something like:

sum(log(h) + h*t for t in M.times) - h*(M.period - sum(M.times))

where M.times is the list (array, whatever) of failure times and M.period is the total observation period. I want the proper numpy broadcasting rules to apply, so that I can do:

log_lik = logp(M_vec,h_vec)

and it will work as long as the dimensions of M_vec and h_vec are compatible.

My current implementation uses numpy.vectorize. That works well enough for 1 and 2, but it is too slow for 3. Note also that I can't do this because the number of failures in my MultiData objects is not known ahead of time.

850

asked Nov 02 '12 16:11

jcrudy

1 Answers

Numpy arrays are most suitable for data types with fixed size. If the objects in the array are not fixed size (such as your MultiEvent) the operations can become much slower.

I would recommend you to store all of the survival times in a 1d linear record array with 3 fields: event_id, time, period. Each event can appear mutliple times in the array:

>>> import numpy as np
>>> rawdata = [(1, 0.4, 4), (1, 0.6, 6), (2,2.6, 6)]
>>> npdata = np.rec.fromrecords(rawdata, names='event_id,time,period')
>>> print npdata
[(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6) (2, 2.6000000000000001, 6)]

To get data for a specific index you could use fancy indexing:

>>> eventdata = npdata[npdata.event_id==1]
>>> print eventdata
[(1, 0.40000000000000002, 4) (1, 0.59999999999999998, 6)]

The advantage of this approach is that you can easily intergrate it with your ndarray-based functions. You can also access this arrays from cython as described in the manual:

cdef packed struct Event:
    np.int32_t event_id
    np.float64_t time
    np.float64_6 period

def f():
    cdef np.ndarray[Event] b = np.zeros(10,
        dtype=np.dtype([('event_id', np.int32),
                        ('time', np.float64),
                        ('period', np.float64)]))
    <...>

answered Sep 24 '22 14:09

btel

Related questions
                            
                                How to force Monotouch AOT Compiler to see a nested generic method?
                            
                                requestLayout() has no effect when called from onSizeChanged()?
                            
                                getIncoming in UrbanAirship for Phonegap
                            
                                How does WinRT handle BitmapImage and Image memory
                            
                                Horizontal glitch lines bug
                            
                                Minimal surface solution in Python
                            
                                Maximum number of Send Attempts
                            
                                Percentage transform-origin for SVG not obeyed in Firefox, only sometimes in WebKit
                            
                                OpenERP font-family: 'Free 3 of 9' for barcode is not working in report webkit
                            
                                Touch Scrolling ScrollViewer in WPF App with RealTimeStylus Disabled
                            
                                Portable shell solution to check if PID is zombied
                            
                                _UIReplicantView and snapshotViewAfterScreenUpdates how to get its content?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With