According to this question and reference NVIDIA CUDA Programming Guide the realloc function is not implemented:
The CUDA in-kernel
malloc()function allocates at leastsizebytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request. The returned pointer is guaranteed to be aligned to a 16-byte boundary.The CUDA in-kernel
free()function deallocates the memory pointed to byptr, which must have been returned by a previous call tomalloc(). Ifptris NULL, the call tofree()is ignored. Repeated calls tofree()with the sameptrhas undefined behavior.
I am currectly stuck with some portion of GMP library (or more strictly my attempt to port it on CUDA), which relies on this functionaliy:
__host__ __device__ static void * // generate this function for both CPU and GPU
gmp_default_realloc (void *old, size_t old_size, size_t new_size)
{
mp_ptr p;
#if __CUDA_ARCH__ // this directive separates device and host code
/* ? */
#else
p = (mp_ptr) realloc (old, new_size); /* host code has realloc from glibc */
#endif
if (!p)
gmp_die("gmp_default_realoc: Virtual memory exhausted.");
return p;
}
Essentially I can just simply call malloc with new_size, then call memcpy (or maybe memmove), then free previous block, but this requires obligatory moving of data (large arrays), which I would like to avoid.
Is there any effective efficient way to implement (standard C or C++) realloc function (i.e. inside kernel) ? Let's say that I have some large array of dynamically allocated data (already allocated by malloc), then in some other place realloc is invoked in order to request some larger amount of memory for that block. In short I would like to avoid copying whole data array into new location and I ask specifically how to do it (of course if it's possible at all).
I am not especially familiar with PTX ISA or underlying implementation of in-kernel heap functions, but maybe it's worth a look into that direction ?
Most malloc implementations over-allocate, this is the reason why realloc can sometimes avoid copying bytes - the old block may be large enough for the new size. But apparently in your environment the system malloc doesn't do that, so I think your only option is to reimplement all 3 primitives, gmp_default_{alloc,realloc,free} on top of the system-provided malloc/free.
There are many open-source malloc implementation out there, glibc has one you might be able to adapt.
I'm not familiar with CUDA or GMP, but off the top of my head:
gmp_malloc() followed by plain free() probably works on "normal" platforms, but will likely cause heap corruption if you go ahead with this
if all you want is a more efficient realloc, you can simply overallocate in your custom malloc (up to some size, say the nearest power of 2), just so you can avoid copying in the subseauent re-alloc. You don't even need a full-blown heap implementation for that.
your implementation may need to use a mutex or some such to protect your heap against concurrent modifications
you can improve performance even more if you never (or infrequently) return the malloc()ed blocks back to the OS from within your custom heap, I.e keep the gmp_free()ed blocks around for subsequent reuse instead of calling the system free() on them immediately
come to think of it, a better idea would be to introduce a sane malloc implementation into that platform, outside of your GMP lib, so that other programs and libraries could draw their memory from the same pool, instead of GMP doing one thing and everything else doing something else. This should help with the overall memory consumption w.r.t previous point. Maybe you should port glibc first :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With