Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differences between null and NaN in spark? How to deal with it?

In my DataFrame, there are columns including values of null and NaN respectively, such as:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b")) df.show()  +----+---+ |   a|  b| +----+---+ |   1|NaN| |null|1.0| +----+---+ 

Are there any difference between those? How can they be dealt with?

like image 332
Ivan Lee Avatar asked May 10 '17 02:05

Ivan Lee


People also ask

How do you deal with nulls in Spark?

You can keep null values out of certain columns by setting nullable to false . You won't be able to set nullable to false for all columns in a DataFrame and pretend like null values don't exist. For example, when joining DataFrames, the join column will return null when a match cannot be made.

What is the difference between null and NaN?

Javascript null represents the intentional absence of any object value. The undefined property indicates that the variable has not been assigned a value or not declared at all. The NaN property represents a “Not-a-Number” value. The NaN property indicates that a value is not a legitimate number.

How do I replace null with zero in Spark?

In Spark, fill() function of DataFrameNaFunctions class is used to replace NULL values on the DataFrame column with either zero(0), empty string, space, or any constant literal values.

How do you handle null in a data frame?

In order to check null values in Pandas DataFrame, we use isnull() function this function return dataframe of Boolean values which are True for NaN values.


1 Answers

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop() 

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0) 

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull()) df.where(col("a").isNotNull()) 

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan df.where(isnan(col("a"))) 
like image 166
Shaido Avatar answered Sep 21 '22 08:09

Shaido