Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: Get first Non-null value of each column in dataframe

I'm dealing with different Spark DataFrames, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime.

I tried doing df.na.drop().first() in a hope that it'll drop all rows with any null value, and of the remaining DataFrame, I'll just get the first row with all non-null values. But many of the DataFrames have so many columns with lot of null values, that df.na.drop() returns empty DataFrame.

I also tried finding if any columns has all null values, so that I could simply drop that columns before trying the above approach, but that still not solved the problem. Any idea how can I accomplish this in efficient way, as this code will be run many times on huge DataFrames?

like image 974
anwartheravian Avatar asked May 09 '17 17:05

anwartheravian


People also ask

How do you get non-null values in PySpark?

Solution: In order to find non-null values of PySpark DataFrame columns, we need to use negate of isNotNull() function for example ~df. name. isNotNull() similarly for non-nan values ~isnan(df.name) .

How do you get the first value of a column in PySpark?

To do this we will use the first() and head() functions. Syntax: dataframe. first()['column name']

How do you find the NULL values in each column in PySpark?

In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().

What does .collect do in PySpark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.


1 Answers

You can use first function with ingorenulls. Let's say data looks like this:

from pyspark.sql.types import StringType, StructType, StructField

schema = StructType([
    StructField("x{}".format(i), StringType(), True) for i in range(3)
])

df = spark.createDataFrame(
    [(None, "foo", "bar"), ("foo", None, "bar"), ("foo", "bar", None)],
    schema
)

You can:

from pyspark.sql.functions import first

df.select([first(x, ignorenulls=True).alias(x) for x in df.columns]).first()
Row(x0='foo', x1='foo', x2='bar')
like image 158
zero323 Avatar answered Nov 15 '22 02:11

zero323