Recommended cudf Dataframe Construction

Question

I'm interested in recommended and fast ways of creating cudf DataFrames from dense numpy objects. I have seen many examples of splitting out columns of a 2d numpy matrix to tuples then calling cudf.DataFrame on a list of tuples -- this is rather expensive. Using numba.cuda.to_device is quite fast. Is it possible to use numba.cuda.to_device or is there a more efficient way of constructing the DataFrame ?

In [1]: import cudf

In [2]: import numba.cuda

In [3]: import numpy as np

In [4]: data = np.random.random((300,100))

In [5]: data.nbytes
Out[5]: 240000

In [6]: %time numba.cuda.to_device(data)
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 4.45 ms
Out[6]: <numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f8954f84550>

In [7]: record_data = (('fea%d'%i, data[:,i]) for i in range(data.shape[1]))

In [8]: %time cudf.DataFrame(record_data)
CPU times: user 960 ms, sys: 508 ms, total: 1.47 s
Wall time: 1.61 s
Out[8]: <cudf.DataFrame ncols=100 nrows=300 >

The above shows cudf.DataFrame ~360x slower than a direct call to numba.cuda.to_device

Thomson Comer · Accepted Answer

cudf.DataFrame is a dedicated columnar format and performs best with data that is very tall instead of wide. However, we have some important zero-copy functions that allow you to move data between numba/cupy/cudf inexpensively. At this point in time, as far as I know, the best way to get a raw numpy matrix into cudf is using the to_device method as you identified, followed by from_gpu_matrix in cudf.

import cudf
import numba.cuda
import numpy as np
data = np.random.random((300, 100))
%time gpu = numba.cuda.to_device(data)
%time df = cudf.DataFrame.from_gpu_matrix(gpu, columns = ['fea%d'%i for i in range(data.shape[1])])

Out:

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 872 µs
CPU times: user 180 ms, sys: 0 ns, total: 180 ms
Wall time: 186 ms

The 186ms in creating the cudf.DataFrame is the minimum creation time, and is overhead primarily for host side management of columnar memory and metadata.

miguelusque · Answer

Please, let me mention that cudf.DataFrame.from_gpu_matrix() method has been deprecated since RAPIDS 0.17.

Nowaday, cudf.DataFrame() accepts Numba DeviceNDArrays as input data.

import cudf
import numba as nb

# Convert a Numba DeviceNDArray to a cuDF DataFrame
src = nb.cuda.to_device([[1, 2], [3, 4]])
dst = cudf.DataFrame(src)

print(type(dst), "
", dst)

Recommended cudf Dataframe Construction

Tags:

python

numpy

rapids

cudf

quasiben

2 Answers

Thomson Comer

miguelusque

Recent Activity

Donate For Us

Recommended cudf Dataframe Construction

Tags:

python

numpy

rapids

cudf

quasiben

2 Answers

Thomson Comer

miguelusque

Related questions

Recent Activity

Donate For Us