I try to add to a df a column with an empty array of arrays of strings, but I end up adding a column of arrays of strings.
I tried this:
import pyspark.sql.functions as F
df = df.withColumn('newCol', F.array([]))
How can I do this in pyspark?
In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .
Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.
You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns. show() function is used to show the Dataframe contents.
Another way to achieve an empty array of arrays column:
import pyspark.sql.functions as F
df = df.withColumn('newCol', F.array(F.array()))
Because F.array()
defaults to an array of strings type, the newCol
column will have type ArrayType(ArrayType(StringType,false),false)
. If you need the inner array to be some type other than string, you can cast the inner F.array()
directly as follows.
import pyspark.sql.functions as F
import pyspark.sql.types as T
int_array_type = T.ArrayType(T.IntegerType()) # "array<integer>" also works
df = df.withColumn('newCol', F.array(F.array().cast(int_array_type)))
In this example, newCol
will have a type of ArrayType(ArrayType(IntegerType,true),false)
.
This is one of the way:
>>> import pyspark.sql.functions as F
>>> myList = [('Alice', 1)]
>>> df = spark.createDataFrame(myList)
>>> df.schema
StructType(List(StructField(_1,StringType,true),StructField(_2,LongType,true)))
>>> df = df.withColumn('temp', F.array()).withColumn("newCol", F.array("temp")).drop("temp")
>>> df.schema
StructType(List(StructField(_1,StringType,true),StructField(_2,LongType,true),StructField(newCol,ArrayType(ArrayType(StringType,false),false),false)))
>>> df
DataFrame[_1: string, _2: bigint, newCol: array<array<string>>]
>>> df.collect()
[Row(_1=u'Alice', _2=1, newCol=[[]])]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With