I have the following sample DataFrame:
a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c | 1 | 2 | 4 | 0 | 0 | null| 0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)]) df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
PySpark FillNa is a PySpark function that is used to replace Null values that are present in the PySpark data frame model in a single or multiple columns in PySpark. This value can be anything depending on the business requirements. It can be 0, empty string, or any constant literal.
You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.
In pyspark the drop() function can be used to remove values/columns from the dataframe.
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset
to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With