How to convert empty arrays to nulls?

I have below dataframe and i need to convert empty arrays to null.

|  id|count(AS)|count(asdr)|
|1110| [12, 45]|   [50, 55]|     
|1111|       []|         []|    
|1112| [45, 46]|   [50, 50]|   
|1113|       []|         []|

i have tried below code which is not working.


expected output should be

|  id|count(AS)|count(asdr)|
|1110| [12, 45]|   [50, 55]|     
|1111|     NUll|       NUll|    
|1112| [45, 46]|   [50, 50]|   
|1113|     NUll|       NUll|
Alice Avatar asked Jan 03 '18 06:01


3 Answers

For your given dataframe, you can simply do the following

from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
    .withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()

You should have output dataframe as

|  id|count(AS)|count(asdr)|
|1110| [12, 45]|   [50, 55]|
|1111|     null|       null|
|1112| [45, 46]|   [50, 50]|
|1113|     null|       null|


In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic

from pyspark.sql import functions as F
for c in df.dtypes:
    if "array" in c[1]:
        df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))

df.dtypes would give you array of tuples with column name and datatype. As for the dataframe in the question it would be

[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]

withColumn is applied to only array columns ("array" in c[1]) where F.size(F.col(c[0])) == 0 is the condition checking for when function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.

Ramesh Maharjan Avatar answered Oct 01 '22 16:10

I don't think thats possible with na.fill, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:

import spark.implicits._
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.functions._

val df = Seq(
  (110, Seq.empty[Int]),
  (111, Seq(1,2,3))

// get names of array-type columns
val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)

// map all empty arrays to nulls
val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))

// non-array-type columns, keep them as they are
val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)

  .select((keepCols ++ emptyArraysAsNulls):_*)

| id|      arr|
|110|     null|
|111|[1, 2, 3]|
Raphael Roth Avatar answered Oct 01 '22 16:10

There is no easy solution like df.na.fill here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft in scala:

val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)

val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, 
    when(size(col(colname)) === 0, null).otherwise(col(colname))))

First, all columns of array type is extracted and then these are iterated through. Since the size function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).

Using the dataframe:

|  id|    col1| col2|
|1110|[12, 11]|   []|
|1111|      []| [11]|
|1112|   [123]|[321]|

The result is as follows:

|  id|    col1| col2|
|1110|[12, 11]| null|
|1111|    null| [11]|
|1112|   [123]|[321]|
Shaido Avatar answered Oct 01 '22 15:10

