Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace all Null values of a dataframe in Pyspark

People also ask

How do I change the null value in Spark DataFrame?

The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame. fillna() or DataFrameNaFunctions. fill() methods.

How do you replace all NULL values?

The ISNULL Function is a built-in function to replace nulls with specified replacement values. To use this function, all you need to do is pass the column name in the first parameter and in the second parameter pass the value with which you want to replace the null value.


You can use df.na.fill to replace nulls with zeros, for example:

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df.na.fill(0).show()
+---+
|col|
+---+
|  1|
|  2|
|  3|
|  0|
+---+

You can use fillna() func.

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df = df.fillna({'col':'4'})
>>> df.show()

or df.fillna({'col':'4'}).show()

+---+
|col|
+---+
|  1|
|  2|
|  3|
|  4|
+---+

Using fillna there are 3 options...

Documentation:

def fillna(self, value, subset=None):
   """Replace null values, alias for ``na.fill()``.
   :func:`DataFrame.fillna` and :func:`DataFrameNaFunctions.fill` are aliases of each other.

   :param value: int, long, float, string, bool or dict.
       Value to replace null values with.
       If the value is a dict, then `subset` is ignored and `value` must be a mapping
       from column name (string) to replacement value. The replacement value must be
       an int, long, float, boolean, or string.
   :param subset: optional list of column names to consider.
       Columns specified in subset that do not have matching data type are ignored.
       For example, if `value` is a string, and subset contains a non-string column,
       then the non-string column is simply ignored.

So you can:

  1. fill all columns with the same value: df.fillna(value)
  2. pass a dictionary of column --> value: df.fillna(dict_of_col_to_value)
  3. pass a list of columns to fill with the same value: df.fillna(value, subset=list_of_cols)

fillna() is an alias for na.fill() so they are the same.