How to create a udf in PySpark which returns an array of strings?

Tags:

I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType).

Now, somehow this is not working:

the dataframe i'm operating on is df_subsets_concat and looks like this:

df_subsets_concat.show(3,False)

+----------------------+
|col1                  |
+----------------------+
|oculunt               |
|predistposed          |
|incredulous           |
+----------------------+
only showing top 3 rows

and the code is

from pyspark.sql.types import ArrayType, FloatType, StringType

my_udf = lambda domain: ['s','n']
label_udf = udf(my_udf, ArrayType(StringType))
df_subsets_concat_with_md = df_subsets_concat.withColumn('subset', label_udf(df_subsets_concat.col1))

and the result is

/usr/lib/spark/python/pyspark/sql/types.py in __init__(self, elementType, containsNull)
    288         False
    289         """
--> 290         assert isinstance(elementType, DataType), "elementType should be DataType"
    291         self.elementType = elementType
    292         self.containsNull = containsNull

AssertionError: elementType should be DataType

It is my understanding that this was the correct way to do this. Here are some resources: pySpark Data Frames "assert isinstance(dataType, DataType), "dataType should be DataType" How to return a "Tuple type" in a UDF in PySpark?

But neither of these have helped me resolve why this is not working. i am using pyspark 1.6.1.

How to create a udf in pyspark which returns an array of strings?

765

asked Dec 06 '17 20:12

makansij

1 Answers

You need to initialize a StringType instance:

label_udf = udf(my_udf, ArrayType(StringType()))
#                                           ^^ 
df.withColumn('subset', label_udf(df.col1)).show()
+------------+------+
|        col1|subset|
+------------+------+
|     oculunt|[s, n]|
|predistposed|[s, n]|
| incredulous|[s, n]|
+------------+------+

answered Sep 20 '22 20:09

Psidom

Related questions
                            
                                How to sort one list based on another? [duplicate]
                            
                                Django model class methods for predefined values
                            
                                ctypes loading a c shared library that has dependencies
                            
                                Exploitable Python Functions [closed]
                            
                                Overflow in exp in scipy/numpy in Python?
                            
                                Remove Max and Min values from python list of integers
                            
                                Python: How to get group ids of one username (like id -Gn )
                            
                                How to convert an image from np.uint16 to np.uint8?
                            
                                Why does json.dumps(list(np.arange(5))) fail while json.dumps(np.arange(5).tolist()) works
                            
                                How to set and get a parent class attribute from an inherited class in Python?
                            
                                Animate a rotating 3D graph in matplotlib
                            
                                Android Market API - Python ImportError: No module named google.protobuf
                            
                                Is "norm" equivalent to "Euclidean distance"?
                            
                                Impute entire DataFrame (all columns) using Scikit-learn (sklearn) without iterating over columns
                            
                                Read all but last line of CSV file in pandas
                            
                                Python dictionary doesn't have all the keys assigned, or items
                            
                                PhantomJS with Selenium error: Message: 'phantomjs' executable needs to be in PATH
                            
                                How to perform k-fold cross validation with tensorflow?
                            
                                How to turn a video into numpy array?
                            
                                How to debug a Python module in Visual Studio Code's launch.json

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create a udf in PySpark which returns an array of strings?

Tags:

python

apache-spark

apache-spark-sql

pyspark

user-defined-functions

makansij

People also ask

1 Answers

Psidom

Recent Activity

Donate For Us