Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to store a numpy array in a Spark Dataframe Column?

I have a dataframe and I apply a function to it. This function returns an numpy array the code looks like this:

create_vector_udf = udf(create_vector, ArrayType(FloatType()))
dataframe = dataframe.withColumn('vector', create_vector_udf('text'))
dmoz_spark_df.select('lang','url','vector').show(20)

Now spark seems not to be happy with this and does not accept ArrayType(FloatType()) I get the following error message: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I could just numpyarray.tolist() and return a list version of it, but obviously I would always have to recreate the array if I want to use it with numpy.

so is there a way to store a numpy array in a dataframe column?

like image 276
Thagor Avatar asked Jul 07 '17 08:07

Thagor


People also ask

Can you put an array in a DataFrame?

To convert an array to a dataframe with Python you need to 1) have your NumPy array (e.g., np_array), and 2) use the pd. DataFrame() constructor like this: df = pd. DataFrame(np_array, columns=['Column1', 'Column2']) . Remember, that each column in your NumPy array needs to be named with columns.

How are NumPy arrays stored?

A NumPy array can be specified to be stored in row-major format, using the keyword argument order='C' , and the column-major format, using the keyword argument order='F' , when the array is created or reshaped. The default format is row-major.

Can NumPy array store arrays?

Yes the point is that each inside array should be numpy not list as this saves all the space. Then make an array from the list of arrays. Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated.

Can NumPy arrays efficiently store data?

NumPy arrays are efficient data structures for working with data in Python, and machine learning models like those in the scikit-learn library, and deep learning models like those in the Keras library, expect input data in the format of NumPy arrays and make predictions in the format of NumPy arrays.


1 Answers

The source of the problem is that object returned from the UDF doesn't conform to the declared type. create_vector must be not only returning numpy.ndarray but also must be converting numerics to the corresponding NumPy types which are not compatible with DataFrame API.

The only option is to use something like this:

udf(lambda x: create_vector(x).tolist(), ArrayType(FloatType()))
like image 154
pissall Avatar answered Sep 21 '22 12:09

pissall