I try to add to a df a column with an empty array of arrays of strings, but I end up adding a column of arrays of strings. I tried this: <pre class="prettyprint"><code>import pyspark.sql.functions as F df = df.withColumn('newCol', F.array([])) </code></pre> How can I do this in pyspark?

Another way to achieve an empty array of arrays column: <pre class="prettyprint"><code>import pyspark.sql.functions as F df = df.withColumn('newCol', F.array(F.array())) </code></pre> Because <code>F.array()</code> defaults to an array of strings type, the <code>newCol</code> column will have type <code>ArrayType(ArrayType(StringType,false),false)</code>. If you need the inner array to be some type other than string, you can cast the inner <code>F.array()</code> directly as follows. <pre class="prettyprint"><code>import pyspark.sql.functions as F import pyspark.sql.types as T int_array_type = T.ArrayType(T.IntegerType()) # "array<integer>" also works df = df.withColumn('newCol', F.array(F.array().cast(int_array_type))) </code></pre> In this example, <code>newCol</code> will have a type of <code>ArrayType(ArrayType(IntegerType,true),false)</code>.

This is one of the way: <pre class="prettyprint"><code>>>> import pyspark.sql.functions as F >>> myList = [('Alice', 1)] >>> df = spark.createDataFrame(myList) >>> df.schema StructType(List(StructField(_1,StringType,true),StructField(_2,LongType,true))) >>> df = df.withColumn('temp', F.array()).withColumn("newCol", F.array("temp")).drop("temp") >>> df.schema StructType(List(StructField(_1,StringType,true),StructField(_2,LongType,true),StructField(newCol,ArrayType(ArrayType(StringType,false),false),false))) >>> df DataFrame[_1: string, _2: bigint, newCol: array<array<string>>] >>> df.collect() [Row(_1=u'Alice', _2=1, newCol=[[]])] </code></pre>

Create an empty array column of certain type in pyspark DataFrame

Tags:

python

dataframe

apache-spark

pyspark

I try to add to a df a column with an empty array of arrays of strings, but I end up adding a column of arrays of strings.

I tried this:

import pyspark.sql.functions as F
df = df.withColumn('newCol', F.array([]))

How can I do this in pyspark?

785

asked Aug 28 '19 12:08

David Taub

2 Answers

Another way to achieve an empty array of arrays column:

import pyspark.sql.functions as F
df = df.withColumn('newCol', F.array(F.array()))

Because F.array() defaults to an array of strings type, the newCol column will have type ArrayType(ArrayType(StringType,false),false). If you need the inner array to be some type other than string, you can cast the inner F.array() directly as follows.

import pyspark.sql.functions as F
import pyspark.sql.types as T
int_array_type = T.ArrayType(T.IntegerType())  # "array<integer>" also works
df = df.withColumn('newCol', F.array(F.array().cast(int_array_type)))

In this example, newCol will have a type of ArrayType(ArrayType(IntegerType,true),false).

126

answered Sep 30 '22 11:09

David Zhao

This is one of the way:

>>> import pyspark.sql.functions as F
>>> myList = [('Alice', 1)]
>>> df = spark.createDataFrame(myList)
>>> df.schema
StructType(List(StructField(_1,StringType,true),StructField(_2,LongType,true)))
>>> df = df.withColumn('temp', F.array()).withColumn("newCol", F.array("temp")).drop("temp")
>>> df.schema
StructType(List(StructField(_1,StringType,true),StructField(_2,LongType,true),StructField(newCol,ArrayType(ArrayType(StringType,false),false),false)))
>>> df
DataFrame[_1: string, _2: bigint, newCol: array<array<string>>]
>>> df.collect()
[Row(_1=u'Alice', _2=1, newCol=[[]])]

answered Sep 30 '22 11:09

moriarty007

Related questions
                            
                                Legend with vertical line in matplotlib
                            
                                Installed pytest but running `pytest` in bash returns `not found`
                            
                                How can I select specific fields in django rest framework? [duplicate]
                            
                                MultiThreading in AWS lambda using Python3
                            
                                Compiling cython with gcc: No such file or directory from #include "ios"
                            
                                Is it possible to use spacy with already tokenized input?
                            
                                Using a variable within a regular expression in Pandas str.contains()
                            
                                Why are bitwise operators slower than multiplication/division/modulo?
                            
                                PIP3 list failed completely and returing error
                            
                                Classification metrics can't handle a mix of binary and continuous targets [duplicate]
                            
                                Spyder reports invalid alias when running any script
                            
                                Paramiko/cryptography deprecation warnings: CryptographyDeprecationWarning: encode_point has been deprecated on EllipticCurvePublicNumbers [duplicate]
                            
                                Unexpected result: pytest.raises with match parameter when asserting error
                            
                                How to add elements in list which is value of dictionary and those elements not be repeated as another keys of that dictionary?
                            
                                Can geopandas get a geopackage's (or other vector file) all layers?
                            
                                Remove the rows from pandas dataframe, that has sentences longer than certain word length
                            
                                tensor.numpy() not working in tensorflow.data.Dataset. Throws the error: AttributeError: 'Tensor' object has no attribute 'numpy'
                            
                                Board game: Find maximum green points with restricted red points
                            
                                Pandas: fill one column with count of # of obs between occurrences in a 2nd column
                            
                                What the code '_T = TypeVar('_T')' means in a *.pyi file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With