Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame:

a    | b    | c   |   1    | 2    | 4   | 0    | null | null|  null | 3    | 4   | 

And I want to replace null values only in the first 2 columns - Column "a" and "b":

a    | b    | c   |   1    | 2    | 4   | 0    | 0    | null|  0    | 3    | 4   | 

Here is the code to create sample dataframe:

rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)]) df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"]) 

I know how to replace all null values using:

df2 = df2.fillna(0) 

And when I try this, I lose the third column:

df2 = df2.select(df2.columns[0:1]).fillna(0) 
like image 553
Rakesh Adhikesavan Avatar asked Jul 12 '17 19:07

Rakesh Adhikesavan


People also ask

How do I select specific columns in Spark DataFrame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

What does FillNa do in PySpark?

PySpark FillNa is a PySpark function that is used to replace Null values that are present in the PySpark data frame model in a single or multiple columns in PySpark. This value can be anything depending on the business requirements. It can be 0, empty string, or any constant literal.

How do you replace values in a column in PySpark?

You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.

How do you exclude columns in PySpark?

In pyspark the drop() function can be used to remove values/columns from the dataframe.


2 Answers

df.fillna(0, subset=['a', 'b']) 

There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1

like image 180
Zhang Tong Avatar answered Oct 07 '22 07:10

Zhang Tong


Use a dictionary to fill values of certain columns:

df.fillna( { 'a':0, 'b':0 } ) 
like image 35
scottlittle Avatar answered Oct 07 '22 08:10

scottlittle