Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert empty arrays to nulls?

I have below dataframe and i need to convert empty arrays to null.

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|     
|1111|       []|         []|    
|1112| [45, 46]|   [50, 50]|   
|1113|       []|         []|
+----+---------+-----------+

i have tried below code which is not working.

df.na.fill("null").show()

expected output should be

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|     
|1111|     NUll|       NUll|    
|1112| [45, 46]|   [50, 50]|   
|1113|     NUll|       NUll|
+----+---------+-----------+
like image 498
Alice Avatar asked Jan 03 '18 06:01

Alice


People also ask

How do you remove empty elements from an array?

We can use an array instance's filter method to remove empty elements from an array. To remove all the null or undefined elements from an array, we can write: const array = [0, 1, null, 2, "", 3, undefined, 3, , , , , , 4, , 4, , 5, , 6, , , , ]; const filtered = array. filter((el) => { return el !==

Is an empty array equal to 0?

The length property sets or returns the number of elements in an array. By knowing the number of elements in the array, you can tell if it is empty or not. An empty array will have 0 elements inside of it.

What is the difference between null and empty array?

null array—-when the size of array is not declared than the array is known as null array. EMPTY ARRAY——-if an array having the size but not values than it's known as empty array.

What happens when an array is empty?

If the length of the object is 0, then the array is considered to be empty and the function will return TRUE. Else the array is not empty and the function will return False.


3 Answers

For your given dataframe, you can simply do the following

from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
    .withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()

You should have output dataframe as

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]|   [50, 55]|
|1111|     null|       null|
|1112| [45, 46]|   [50, 50]|
|1113|     null|       null|
+----+---------+-----------+

Updated

In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic

from pyspark.sql import functions as F
for c in df.dtypes:
    if "array" in c[1]:
        df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
df.show()

Here,
df.dtypes would give you array of tuples with column name and datatype. As for the dataframe in the question it would be

[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]

withColumn is applied to only array columns ("array" in c[1]) where F.size(F.col(c[0])) == 0 is the condition checking for when function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.

like image 61
Ramesh Maharjan Avatar answered Oct 01 '22 16:10

Ramesh Maharjan


I don't think thats possible with na.fill, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:

import spark.implicits._
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.functions._

val df = Seq(
  (110, Seq.empty[Int]),
  (111, Seq(1,2,3))
).toDF("id","arr")

// get names of array-type columns
val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)

// map all empty arrays to nulls
val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))

// non-array-type columns, keep them as they are
val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)

df
  .select((keepCols ++ emptyArraysAsNulls):_*)
  .show()

+---+---------+
| id|      arr|
+---+---------+
|110|     null|
|111|[1, 2, 3]|
+---+---------+
like image 32
Raphael Roth Avatar answered Oct 01 '22 16:10

Raphael Roth


There is no easy solution like df.na.fill here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft in scala:

val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)

val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, 
    when(size(col(colname)) === 0, null).otherwise(col(colname))))

First, all columns of array type is extracted and then these are iterated through. Since the size function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).

Using the dataframe:

+----+--------+-----+
|  id|    col1| col2|
+----+--------+-----+
|1110|[12, 11]|   []|
|1111|      []| [11]|
|1112|   [123]|[321]|
+----+--------+-----+

The result is as follows:

+----+--------+-----+
|  id|    col1| col2|
+----+--------+-----+
|1110|[12, 11]| null|
|1111|    null| [11]|
|1112|   [123]|[321]|
+----+--------+-----+
like image 36
Shaido Avatar answered Oct 01 '22 15:10

Shaido