PySpark: Get first Non-null value of each column in dataframe

Tags:

I'm dealing with different Spark DataFrames, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime.

I tried doing df.na.drop().first() in a hope that it'll drop all rows with any null value, and of the remaining DataFrame, I'll just get the first row with all non-null values. But many of the DataFrames have so many columns with lot of null values, that df.na.drop() returns empty DataFrame.

I also tried finding if any columns has all null values, so that I could simply drop that columns before trying the above approach, but that still not solved the problem. Any idea how can I accomplish this in efficient way, as this code will be run many times on huge DataFrames?

974

asked May 09 '17 17:05

anwartheravian

1 Answers

You can use first function with ingorenulls. Let's say data looks like this:

Click to copy

from pyspark.sql.types import StringType, StructType, StructField

schema = StructType([
    StructField("x{}".format(i), StringType(), True) for i in range(3)
])

df = spark.createDataFrame(
    [(None, "foo", "bar"), ("foo", None, "bar"), ("foo", "bar", None)],
    schema
)

You can:

Click to copy

from pyspark.sql.functions import first

df.select([first(x, ignorenulls=True).alias(x) for x in df.columns]).first()

Click to copy

Row(x0='foo', x1='foo', x2='bar')

158

answered Nov 15 '22 02:11

zero323

Related questions
                            
                                Python: Install Tesseract for Windows 7
                            
                                Django prefetch_related optimize query but still very slow
                            
                                Sort a list with longest items first
                            
                                List column names that are NULL/Empty for a Dataframe in each row
                            
                                Inserting dictionary to heap Python
                            
                                Message not sending because of a "BAD REQUEST" with discord.py
                            
                                slice/split a layer in keras as in caffe
                            
                                Installing tensorflow on windows
                            
                                What does error_already_set in Boost.python do and how to handle exceptions similarly in Python C API
                            
                                How to keep numpy from broadcasting when creating an object array of different shaped arrays
                            
                                Inheriting a class with the same name in Python
                            
                                Loading dll using Python Ctypes
                            
                                How can plot Results of the Friedman-Nemenyi test using python
                            
                                Creating Threads within a Thread in Python
                            
                                Pandas DataFrame Transpose multi columns
                            
                                Discussion around bitwise operator statement
                            
                                Python AWS Boto3: How to read files from S3 bucket?
                            
                                Python 3.x list comprehension VS tuple generator
                            
                                Using selenium webdriver to wait the attribute of element to change value
                            
                                Pandas vs. Numpy Dataframes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark: Get first Non-null value of each column in dataframe

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

anwartheravian

People also ask

1 Answers

zero323

Recent Activity

Donate For Us