In my DataFrame, there are columns including values of null and NaN respectively, such as: <pre class="prettyprint"><code>df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b")) df.show() +----+---+ | a| b| +----+---+ | 1|NaN| |null|1.0| +----+---+ </code></pre> Are there any difference between those? How can they be dealt with?

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists. NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. <code>0.0/0.0</code>. One possible way to handle null values is to remove them with: <pre class="prettyprint"><code>df.na.drop() </code></pre> Or you can change them to an actual value (here I used 0) with: <pre class="prettyprint"><code>df.na.fill(0) </code></pre> Another way would be to select the rows where a specific column is null for further processing: <pre class="prettyprint"><code>df.where(col("a").isNull()) df.where(col("a").isNotNull()) </code></pre> Rows with NaN can also be selected using the equivalent method: <pre class="prettyprint"><code>from pyspark.sql.functions import isnan df.where(isnan(col("a"))) </code></pre>

Differences between null and NaN in spark? How to deal with it?

Tags:

python

null

nan

apache-spark

pyspark

In my DataFrame, there are columns including values of null and NaN respectively, such as:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b")) df.show()  +----+---+ |   a|  b| +----+---+ |   1|NaN| |null|1.0| +----+---+

Are there any difference between those? How can they be dealt with?

332

asked May 10 '17 02:05

Ivan Lee

1 Answers

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull()) df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan df.where(isnan(col("a")))

166

answered Sep 21 '22 08:09

Shaido

Related questions
                            
                                Python: NameError: global name 'foobar' is not defined [duplicate]
                            
                                Difference between plt.close() and plt.clf()
                            
                                Updating openssl in python 2.7
                            
                                Python equivalent of the R operator "%in%"
                            
                                Why can't Python see environment variables? [duplicate]
                            
                                Pandas dataframe - running sum with reset
                            
                                'required' is an invalid argument for positionals in python command
                            
                                multiple authentication backends configured and therefore must provide the `backend` argument or set the `backend` attribute on the user
                            
                                How to Copy/Clone a Virtual Environment from Server to Local Machine
                            
                                Django FileField (or ImageField) open() method returns None for valid file?
                            
                                Why doesn't an import in an exec in a function work?
                            
                                How do I change the current line highlight background in Pycharm?
                            
                                Django bulk_create function example
                            
                                multiple databases and multiple models in django
                            
                                MySQLdb Python insert %d and %s
                            
                                Receiving Broadcast Packets in Python
                            
                                How do I clone a conda environment from one python release to another?
                            
                                How to convert defaultdict of defaultdicts [of defaultdicts] to dict of dicts [of dicts]?
                            
                                Firefox Build does not work with Selenium
                            
                                How to substitute multiple symbols in an expression in sympy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With