I have below dataframe and i need to convert empty arrays to null.
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]| [50, 55]|
|1111| []| []|
|1112| [45, 46]| [50, 50]|
|1113| []| []|
+----+---------+-----------+
i have tried below code which is not working.
df.na.fill("null").show()
expected output should be
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]| [50, 55]|
|1111| NUll| NUll|
|1112| [45, 46]| [50, 50]|
|1113| NUll| NUll|
+----+---------+-----------+
We can use an array instance's filter method to remove empty elements from an array. To remove all the null or undefined elements from an array, we can write: const array = [0, 1, null, 2, "", 3, undefined, 3, , , , , , 4, , 4, , 5, , 6, , , , ]; const filtered = array. filter((el) => { return el !==
The length property sets or returns the number of elements in an array. By knowing the number of elements in the array, you can tell if it is empty or not. An empty array will have 0 elements inside of it.
null array—-when the size of array is not declared than the array is known as null array. EMPTY ARRAY——-if an array having the size but not values than it's known as empty array.
If the length of the object is 0, then the array is considered to be empty and the function will return TRUE. Else the array is not empty and the function will return False.
For your given dataframe
, you can simply do the following
from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
.withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()
You should have output dataframe
as
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]| [50, 55]|
|1111| null| null|
|1112| [45, 46]| [50, 50]|
|1113| null| null|
+----+---------+-----------+
Updated
In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic
from pyspark.sql import functions as F
for c in df.dtypes:
if "array" in c[1]:
df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
df.show()
Here,df.dtypes
would give you array of tuples with column name and datatype. As for the dataframe in the question it would be
[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]
withColumn
is applied to only array columns ("array" in c[1])
where F.size(F.col(c[0])) == 0
is the condition checking for when
function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.
I don't think thats possible with na.fill
, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:
import spark.implicits._
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.functions._
val df = Seq(
(110, Seq.empty[Int]),
(111, Seq(1,2,3))
).toDF("id","arr")
// get names of array-type columns
val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)
// map all empty arrays to nulls
val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))
// non-array-type columns, keep them as they are
val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)
df
.select((keepCols ++ emptyArraysAsNulls):_*)
.show()
+---+---------+
| id| arr|
+---+---------+
|110| null|
|111|[1, 2, 3]|
+---+---------+
There is no easy solution like df.na.fill
here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft
in scala:
val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)
val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname,
when(size(col(colname)) === 0, null).otherwise(col(colname))))
First, all columns of array type is extracted and then these are iterated through. Since the size
function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).
Using the dataframe:
+----+--------+-----+
| id| col1| col2|
+----+--------+-----+
|1110|[12, 11]| []|
|1111| []| [11]|
|1112| [123]|[321]|
+----+--------+-----+
The result is as follows:
+----+--------+-----+
| id| col1| col2|
+----+--------+-----+
|1110|[12, 11]| null|
|1111| null| [11]|
|1112| [123]|[321]|
+----+--------+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With