Filter Pyspark dataframe column with None value

People also ask

How do I find null values in a column in PySpark?

In PySpark DataFrame you can calculate the count of Null, None, NaN & Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().

How do you remove null values from a column in PySpark?

In order to remove Rows with NULL values on selected columns of PySpark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

How do you replace null values in a column in PySpark?

In PySpark, DataFrame. fillna() or DataFrameNaFunctions. fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values.

You can use Column.isNull / Column.isNotNull:

df.where(col("dt_mvmt").isNull())

df.where(col("dt_mvmt").isNotNull())

If you want to simply drop NULL values you can use na.drop with subset argument:

df.na.drop(subset=["dt_mvmt"])

Equality based comparisons with NULL won't work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL:

sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## |         null|
## +-------------+


sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## |               null|
## +-------------------+

The only valid method to compare value with NULL is IS / IS NOT which are equivalent to the isNull / isNotNull method calls.

Try to just use isNotNull function.

df.filter(df.dt_mvmt.isNotNull()).count()

To obtain entries whose values in the dt_mvmt column are not null we have

df.filter("dt_mvmt is not NULL")

and for entries which are null we have

df.filter("dt_mvmt is NULL")

There are multiple ways you can remove/filter the null values from a column in DataFrame.

Lets create a simple DataFrame with below code:

date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())

Now you can try one of the below approach to filter out the null values.

# Approach - 1
df.filter("value is not null").show()

# Approach - 2
df.filter(col("value").isNotNull()).show()

# Approach - 3
df.filter(df["value"].isNotNull()).show()

# Approach - 4
df.filter(df.value.isNotNull()).show()

# Approach - 5
df.na.drop(subset=["value"]).show()

# Approach - 6
df.dropna(subset=["value"]).show()

# Note: You can also use where function instead of a filter.

You can also check the section "Working with NULL Values" on my blog for more information.

I hope it helps.

isNull()/isNotNull() will return the respective rows which have dt_mvmt as Null or !Null.

method_1 = df.filter(df['dt_mvmt'].isNotNull()).count()
method_2 = df.filter(df.dt_mvmt.isNotNull()).count()

Both will return the same result

if column = None

COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------

Use create a temptable on data frame:

sqlContext.sql("select * from tempTable where column_old_value='None' ").show()

So use : column_old_value='None'

If you want to keep with the Pandas syntex this worked for me.

df = df[df.dt_mvmt.isNotNull()]

Related questions
                            
                                How do I install opencv using pip?
                            
                                How to plot normal distribution?
                            
                                Multiprocessing : use tqdm to display a progress bar
                            
                                How do I perform HTML decoding/encoding using Python/Django?
                            
                                Combining two lists and removing duplicates, without removing duplicates in original list
                            
                                Where is virtualenvwrapper.sh after pip install?
                            
                                Create nice column output in python
                            
                                Regular expression to return text between parenthesis
                            
                                Shared-memory objects in multiprocessing
                            
                                Creating a dynamic choice field
                            
                                Is there a way to auto-adjust Excel column widths with pandas.ExcelWriter?
                            
                                subtract two times in python
                            
                                Insert an element at a specific index in a list and return the updated list
                            
                                Good ways to sort a queryset? - Django
                            
                                Find the nth occurrence of substring in a string
                            
                                How to join absolute and relative urls?
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)
                            
                                What is Ruby equivalent of Python's `s= "hello, %s. Where is %s?" % ("John","Mary")`
                            
                                Get object by id()? [duplicate]
                            
                                How do I install Python packages on Windows?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter Pyspark dataframe column with None value

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

People also ask

Recent Activity

Donate For Us