Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

filter only not empty arrays dataframe spark [duplicate]

How can i filter only not empty arrays

import  org.apache.spark.sql.types.ArrayType

  val arrayFields = secondDF.schema.filter(st => st.dataType.isInstanceOf[ArrayType])
  val names = arrayFields.map(_.name)

Or is this code

val DF1=DF.select(col("key"),explode(col("objectiveAttachment")).as("collection")).select(col("collection.*"),col("key"))

|-- objectiveAttachment: array (nullable = true) 
 | |-- element: string (containsNull = true) 

I get this error

 org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Attribute: ArrayBuffer(collection);

Any help is appreciated.

like image 247
A kram Avatar asked Apr 01 '19 18:04

A kram


People also ask

How do I filter NOT NULL values in Spark DataFrame?

In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame.

How do you use NOT NULL in PySpark?

Solution: In order to find non-null values of PySpark DataFrame columns, we need to use negate of isNotNull() function for example ~df. name. isNotNull() similarly for non-nan values ~isnan(df.name) .

What is difference between filter and where in Spark DataFrame?

The Spark where() function is defined to filter rows from the DataFrame or the Dataset based on the given one or multiple conditions or SQL expression. The where() operator can be used instead of the filter when the user has the SQL background. Both the where() and filter() functions operate precisely the same.

IS NOT NULL in Spark?

The isNotNull method returns true if the column does not contain a null value, and false otherwise. The isin method returns true if the column is contained in a list of arguments and false otherwise. You will use the isNull , isNotNull , and isin methods constantly when writing Spark code.


2 Answers

Use the function size

import org.apache.spark.sql.functions._

secondDF.filter(size($"objectiveAttachment") > 0)
like image 62
Henrique Florencio Avatar answered Oct 03 '22 04:10

Henrique Florencio


Try with size() function from org.apache.spark.sql.functions._

    import org.apache.spark.sql.functions._
    val df1=df.select(col("key"),explode(col("objectiveAttachment")).as("collection")).select(col("collection.*"),col("ins"))
.filter(size($"objectiveAttachment")>0)
like image 25
deo Avatar answered Oct 03 '22 04:10

deo