I'm interested in recommended and fast ways of creating cudf DataFrames from dense numpy objects. I have seen many examples of splitting out columns of a 2d numpy matrix to tuples then calling cudf.DataFrame
on a list of tuples -- this is rather expensive. Using numba.cuda.to_device
is quite fast. Is it possible to use numba.cuda.to_device
or is there a more efficient way of constructing the DataFrame ?
In [1]: import cudf
In [2]: import numba.cuda
In [3]: import numpy as np
In [4]: data = np.random.random((300,100))
In [5]: data.nbytes
Out[5]: 240000
In [6]: %time numba.cuda.to_device(data)
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 4.45 ms
Out[6]: <numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f8954f84550>
In [7]: record_data = (('fea%d'%i, data[:,i]) for i in range(data.shape[1]))
In [8]: %time cudf.DataFrame(record_data)
CPU times: user 960 ms, sys: 508 ms, total: 1.47 s
Wall time: 1.61 s
Out[8]: <cudf.DataFrame ncols=100 nrows=300 >
The above shows cudf.DataFrame
~360x slower than a direct call to numba.cuda.to_device
cudf.DataFrame
is a dedicated columnar format and performs best with data that is very tall instead of wide. However, we have some important zero-copy functions that allow you to move data between numba/cupy/cudf
inexpensively. At this point in time, as far as I know, the best way to get a raw numpy
matrix into cudf
is using the to_device
method as you identified, followed by from_gpu_matrix
in cudf
.
import cudf
import numba.cuda
import numpy as np
data = np.random.random((300, 100))
%time gpu = numba.cuda.to_device(data)
%time df = cudf.DataFrame.from_gpu_matrix(gpu, columns = ['fea%d'%i for i in range(data.shape[1])])
Out:
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 872 µs
CPU times: user 180 ms, sys: 0 ns, total: 180 ms
Wall time: 186 ms
The 186ms in creating the cudf.DataFrame
is the minimum creation time, and is overhead primarily for host side management of columnar memory and metadata.
Please, let me mention that cudf.DataFrame.from_gpu_matrix()
method has been deprecated since RAPIDS 0.17.
Nowaday, cudf.DataFrame()
accepts Numba DeviceNDArray
s as input data.
import cudf
import numba as nb
# Convert a Numba DeviceNDArray to a cuDF DataFrame
src = nb.cuda.to_device([[1, 2], [3, 4]])
dst = cudf.DataFrame(src)
print(type(dst), "\n", dst)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With