Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrames using unionAll.

What is the most elegant workaround for adding a null column to a DataFrame to facilitate a unionAll?

My version goes like this:

from pyspark.sql.types import StringType from pyspark.sql.functions import UserDefinedFunction to_none = UserDefinedFunction(lambda x: None, StringType()) new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old'])) 
like image 331
architectonic Avatar asked Oct 09 '15 12:10

architectonic


People also ask

How do I add a blank column to a DataFrame Spark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I add a column in Spark dataset?

A new column could be added to an existing Dataset using Dataset. withColumn() method. withColumn accepts two arguments: the column name to be added, and the Column and returns a new Dataset<Row>. The syntax of withColumn() is provided below.

How do you create an empty column in DF?

There are multiple ways to add a new empty/blank column (single or multiple columns) to a pandas DataFrame by using assign operator, assign() , insert() and apply() methods. By using these you can add one or multiple empty columns with either NaN , None , Blank or Empty string values to all cells.

How do you add NULL values in Pyspark DataFrame?

In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when().

How to add a new column to a spark dataframe?

Adding a new column or multiple columns to Spark DataFrame can be done using withColumn () and select () methods of DataFrame, In this article, I will explain how to add a new column from the existing column, adding a constant or literal value and finally adding a list column to DataFrame.

How to add an empty column in a Dataframe?

Use reindex () method to add columns by shuffling. This can be used if you have few columns. By default this add all columns with NaN values. Use insert () method to add new column at any position of the DataFrame. For example at the beginning, middle, end or any specified position. 1. Quick Examples of Adding an Empty Column

How to create an empty Dataframe with the specified structtype in spark?

SparkSession provides an emptyDataFrame () method, which returns the empty DataFrame with empty schema, but we wanted to create with the specified StructType schema. val df = spark. emptyDataFrame Create empty DataFrame with schema (StructType) Use createDataFrame () from SparkSession

How to create an empty Dataframe with schema in Python?

Creating an empty dataframe with schema 1 Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’]. 2 Specify data as empty ( []) and schema as columns in CreateDataFrame () method. More ...


2 Answers

All you need here is a literal and cast:

from pyspark.sql.functions import lit  new_df = old_df.withColumn('new_column', lit(None).cast(StringType())) 

A full example:

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF() df.printSchema()  ## root ##  |-- foo: long (nullable = true) ##  |-- bar: string (nullable = true)  new_df = df.withColumn('new_column', lit(None).cast(StringType())) new_df.printSchema()  ## root ##  |-- foo: long (nullable = true) ##  |-- bar: string (nullable = true) ##  |-- new_column: string (nullable = true)  new_df.show()  ## +---+---+----------+ ## |foo|bar|new_column| ## +---+---+----------+ ## |  1|  2|      null| ## |  2|  3|      null| ## +---+---+----------+ 

A Scala equivalent can be found here: Create new Dataframe with empty/null field values

like image 86
zero323 Avatar answered Oct 08 '22 19:10

zero323


I would cast lit(None) to NullType instead of StringType. So that if we ever have to filter out not null rows on that column...it can be easily done as follows

df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF()  new_df = df.withColumn('new_column', lit(None).cast(NullType()))  new_df.printSchema()   df_null = new_df.filter(col("new_column").isNull()).show() df_non_null = new_df.filter(col("new_column").isNotNull()).show() 

Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").

like image 42
Shrikant Prabhu Avatar answered Oct 08 '22 17:10

Shrikant Prabhu