I have below dataframe and i need to convert empty arrays to null. <pre class="prettyprint"><code>+----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| []| []| |1112| [45, 46]| [50, 50]| |1113| []| []| +----+---------+-----------+ </code></pre> i have tried below code which is not working. <pre class="prettyprint"><code>df.na.fill("null").show() </code></pre> expected output should be <pre class="prettyprint"><code>+----+---------+-----------+ | id|count(AS)|count(asdr)| +----+---------+-----------+ |1110| [12, 45]| [50, 55]| |1111| NUll| NUll| |1112| [45, 46]| [50, 50]| |1113| NUll| NUll| +----+---------+-----------+ </code></pre>

There is no easy solution like <code>df.na.fill</code> here. One way would be to loop over all relevant columns and replace values where appropriate. Example using <code>foldLeft</code> in scala: <pre class="prettyprint lang-scala prettyprint-override"><code>val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name) val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, when(size(col(colname)) === 0, null).otherwise(col(colname)))) </code></pre> First, all columns of array type is extracted and then these are iterated through. Since the <code>size</code> function is only defined for columns of array type this is a necessary step (and avoids looping over all columns). Using the dataframe: <pre class="prettyprint"><code>+----+--------+-----+ | id| col1| col2| +----+--------+-----+ |1110|[12, 11]| []| |1111| []| [11]| |1112| [123]|[321]| +----+--------+-----+ </code></pre> The result is as follows: <pre class="prettyprint"><code>+----+--------+-----+ | id| col1| col2| +----+--------+-----+ |1110|[12, 11]| null| |1111| null| [11]| |1112| [123]|[321]| +----+--------+-----+ </code></pre>

How to convert empty arrays to nulls?

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

I have below dataframe and i need to convert empty arrays to null.

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|     
|1111|       []|         []|    
|1112| [45, 46]|   [50, 50]|   
|1113|       []|         []|
+----+---------+-----------+

i have tried below code which is not working.

df.na.fill("null").show()

expected output should be

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|     
|1111|     NUll|       NUll|    
|1112| [45, 46]|   [50, 50]|   
|1113|     NUll|       NUll|
+----+---------+-----------+

498

asked Jan 03 '18 06:01

Alice

3 Answers

For your given dataframe, you can simply do the following

from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
    .withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()

You should have output dataframe as

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|
|1111|     null|       null|
|1112| [45, 46]|   [50, 50]|
|1113|     null|       null|
+----+---------+-----------+

Updated

In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic

from pyspark.sql import functions as F
for c in df.dtypes:
    if "array" in c[1]:
        df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
df.show()

Here,
df.dtypes would give you array of tuples with column name and datatype. As for the dataframe in the question it would be

[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]

withColumn is applied to only array columns ("array" in c[1]) where F.size(F.col(c[0])) == 0 is the condition checking for when function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.

answered Oct 01 '22 16:10

Ramesh Maharjan

I don't think thats possible with na.fill, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:

import spark.implicits._
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.functions._

val df = Seq(
  (110, Seq.empty[Int]),
  (111, Seq(1,2,3))
).toDF("id","arr")

// get names of array-type columns
val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)

// map all empty arrays to nulls
val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))

// non-array-type columns, keep them as they are
val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)

df
  .select((keepCols ++ emptyArraysAsNulls):_*)
  .show()

+---+---------+
| id|      arr|
+---+---------+
|110|     null|
|111|[1, 2, 3]|
+---+---------+

answered Oct 01 '22 16:10

Raphael Roth

There is no easy solution like df.na.fill here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft in scala:

val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)

val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, 
    when(size(col(colname)) === 0, null).otherwise(col(colname))))

First, all columns of array type is extracted and then these are iterated through. Since the size function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).

Using the dataframe:

+----+--------+-----+
|  id|    col1| col2|
+----+--------+-----+
|1110|[12, 11]|   []|
|1111|      []| [11]|
|1112|   [123]|[321]|
+----+--------+-----+

The result is as follows:

+----+--------+-----+
|  id|    col1| col2|
+----+--------+-----+
|1110|[12, 11]| null|
|1111|    null| [11]|
|1112|   [123]|[321]|
+----+--------+-----+

answered Oct 01 '22 15:10

Shaido

Related questions
                            
                                In pyspark, is it possible to fillna with another column?
                            
                                filter only not empty arrays dataframe spark [duplicate]
                            
                                How to set up mesos for running spark on standalone OS/X
                            
                                Ungrouping a (key, list(values)) pair in Spark/Scala
                            
                                Filter out rows with NaN values for certain column
                            
                                How to connect to Amazon Redshift or other DB's in Apache Spark?
                            
                                Spark Shell stuck in YARN Accepted state
                            
                                Calculate a grouped median in pyspark
                            
                                spark scala : Convert Array of Struct column to String column
                            
                                spark select and add columns with alias
                            
                                What does withReplacement do, if specified for sample against a Spark Dataframe
                            
                                Apache Spark: dealing with Option/Some/None in RDDs
                            
                                How to access local files in Spark on Windows?
                            
                                GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table
                            
                                Concatenate Sparse Vectors in Spark?
                            
                                JSON file parsing in Pyspark
                            
                                How to check if array column is inside another column array in PySpark dataframe
                            
                                Count number of columns in pyspark Dataframe?
                            
                                How to concatenate/append multiple Spark dataframes column wise in Pyspark?
                            
                                Spark _temporary creation reason

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With