As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrame
s using unionAll
.
What is the most elegant workaround for adding a null
column to a DataFrame
to facilitate a unionAll
?
My version goes like this:
from pyspark.sql.types import StringType from pyspark.sql.functions import UserDefinedFunction to_none = UserDefinedFunction(lambda x: None, StringType()) new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))
In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .
A new column could be added to an existing Dataset using Dataset. withColumn() method. withColumn accepts two arguments: the column name to be added, and the Column and returns a new Dataset<Row>. The syntax of withColumn() is provided below.
There are multiple ways to add a new empty/blank column (single or multiple columns) to a pandas DataFrame by using assign operator, assign() , insert() and apply() methods. By using these you can add one or multiple empty columns with either NaN , None , Blank or Empty string values to all cells.
In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when().
Adding a new column or multiple columns to Spark DataFrame can be done using withColumn () and select () methods of DataFrame, In this article, I will explain how to add a new column from the existing column, adding a constant or literal value and finally adding a list column to DataFrame.
Use reindex () method to add columns by shuffling. This can be used if you have few columns. By default this add all columns with NaN values. Use insert () method to add new column at any position of the DataFrame. For example at the beginning, middle, end or any specified position. 1. Quick Examples of Adding an Empty Column
SparkSession provides an emptyDataFrame () method, which returns the empty DataFrame with empty schema, but we wanted to create with the specified StructType schema. val df = spark. emptyDataFrame Create empty DataFrame with schema (StructType) Use createDataFrame () from SparkSession
Creating an empty dataframe with schema 1 Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’]. 2 Specify data as empty ( []) and schema as columns in CreateDataFrame () method. More ...
All you need here is a literal and cast:
from pyspark.sql.functions import lit new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))
A full example:
df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF() df.printSchema() ## root ## |-- foo: long (nullable = true) ## |-- bar: string (nullable = true) new_df = df.withColumn('new_column', lit(None).cast(StringType())) new_df.printSchema() ## root ## |-- foo: long (nullable = true) ## |-- bar: string (nullable = true) ## |-- new_column: string (nullable = true) new_df.show() ## +---+---+----------+ ## |foo|bar|new_column| ## +---+---+----------+ ## | 1| 2| null| ## | 2| 3| null| ## +---+---+----------+
A Scala equivalent can be found here: Create new Dataframe with empty/null field values
I would cast lit(None) to NullType instead of StringType. So that if we ever have to filter out not null rows on that column...it can be easily done as follows
df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF() new_df = df.withColumn('new_column', lit(None).cast(NullType())) new_df.printSchema() df_null = new_df.filter(col("new_column").isNull()).show() df_non_null = new_df.filter(col("new_column").isNotNull()).show()
Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With