Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommended cudf Dataframe Construction

I'm interested in recommended and fast ways of creating cudf DataFrames from dense numpy objects. I have seen many examples of splitting out columns of a 2d numpy matrix to tuples then calling cudf.DataFrame on a list of tuples -- this is rather expensive. Using numba.cuda.to_device is quite fast. Is it possible to use numba.cuda.to_device or is there a more efficient way of constructing the DataFrame ?

In [1]: import cudf

In [2]: import numba.cuda

In [3]: import numpy as np

In [4]: data = np.random.random((300,100))

In [5]: data.nbytes
Out[5]: 240000

In [6]: %time numba.cuda.to_device(data)
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 4.45 ms
Out[6]: <numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f8954f84550>

In [7]: record_data = (('fea%d'%i, data[:,i]) for i in range(data.shape[1]))

In [8]: %time cudf.DataFrame(record_data)
CPU times: user 960 ms, sys: 508 ms, total: 1.47 s
Wall time: 1.61 s
Out[8]: <cudf.DataFrame ncols=100 nrows=300 >

The above shows cudf.DataFrame ~360x slower than a direct call to numba.cuda.to_device

like image 310
quasiben Avatar asked Sep 03 '25 03:09

quasiben


2 Answers

cudf.DataFrame is a dedicated columnar format and performs best with data that is very tall instead of wide. However, we have some important zero-copy functions that allow you to move data between numba/cupy/cudf inexpensively. At this point in time, as far as I know, the best way to get a raw numpy matrix into cudf is using the to_device method as you identified, followed by from_gpu_matrix in cudf.

import cudf
import numba.cuda
import numpy as np
data = np.random.random((300, 100))
%time gpu = numba.cuda.to_device(data)
%time df = cudf.DataFrame.from_gpu_matrix(gpu, columns = ['fea%d'%i for i in range(data.shape[1])])

Out:

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 872 µs
CPU times: user 180 ms, sys: 0 ns, total: 180 ms
Wall time: 186 ms

The 186ms in creating the cudf.DataFrame is the minimum creation time, and is overhead primarily for host side management of columnar memory and metadata.

like image 80
Thomson Comer Avatar answered Sep 07 '25 18:09

Thomson Comer


Please, let me mention that cudf.DataFrame.from_gpu_matrix() method has been deprecated since RAPIDS 0.17.

Nowaday, cudf.DataFrame() accepts Numba DeviceNDArrays as input data.

import cudf
import numba as nb

# Convert a Numba DeviceNDArray to a cuDF DataFrame
src = nb.cuda.to_device([[1, 2], [3, 4]])
dst = cudf.DataFrame(src)

print(type(dst), "\n", dst)
like image 44
miguelusque Avatar answered Sep 07 '25 17:09

miguelusque