Is it possible to store a numpy array in a Spark Dataframe Column?

Tags:

I have a dataframe and I apply a function to it. This function returns an numpy array the code looks like this:

create_vector_udf = udf(create_vector, ArrayType(FloatType()))
dataframe = dataframe.withColumn('vector', create_vector_udf('text'))
dmoz_spark_df.select('lang','url','vector').show(20)

Now spark seems not to be happy with this and does not accept ArrayType(FloatType()) I get the following error message: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I could just numpyarray.tolist() and return a list version of it, but obviously I would always have to recreate the array if I want to use it with numpy.

so is there a way to store a numpy array in a dataframe column?

276

asked Jul 07 '17 08:07

Thagor

1 Answers

The source of the problem is that object returned from the UDF doesn't conform to the declared type. create_vector must be not only returning numpy.ndarray but also must be converting numerics to the corresponding NumPy types which are not compatible with DataFrame API.

The only option is to use something like this:

Click to copy

udf(lambda x: create_vector(x).tolist(), ArrayType(FloatType()))

154

answered Sep 21 '22 12:09

pissall

Related questions
                            
                                Conditional operations on numpy arrays
                            
                                How to truncate the values of a 2D numpy array
                            
                                What is the inverse of numpy's log1p()?
                            
                                convert images from [-1; 1] to [0; 255]
                            
                                Set white background for a png instead of transparency with OpenCV
                            
                                Remove points which contains pixels fewer than (N)
                            
                                more efficient way to pickle a string
                            
                                python - simple way to join 2 arrays/lists based on common values
                            
                                MAPE calculation in python
                            
                                Combining NumPy arrays
                            
                                TypeError: 'numpy.float64' object does not support item assignment
                            
                                Share OpenCV C++ Object with Python
                            
                                It appears I've run out of 32-bit address space. What are my options?
                            
                                How to do a reduction with numpy.nditer in the first axis
                            
                                Summing over pair of indices (or more) in Python
                            
                                numpy.memmap from numpy operations
                            
                                rearranging rows in a big numpy array zeros some rows. How to fix it?
                            
                                Incorrect results when applying solution to real data
                            
                                How can I get a fast estimate for the distance between a point and a bicubic spline surface in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to store a numpy array in a Spark Dataframe Column?

Tags:

numpy

pyspark

spark-dataframe

Thagor

People also ask

1 Answers

pissall

Recent Activity

Donate For Us