How to add a constant column in a Spark DataFrame?

People also ask

How do I add a constant column in Spark DataFrame?

Add New Column with Constant Value In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I add a column in Spark dataset?

A new column could be added to an existing Dataset using Dataset. withColumn() method. withColumn accepts two arguments: the column name to be added, and the Column and returns a new Dataset<Row>. The syntax of withColumn() is provided below.

How do you add a literal constant to a DataFrame?

PySpark lit() – Add Literal or Constant to DataFramePySpark SQL functions lit() and typedLit() are used to add a new column to DataFrame by assigning a literal or constant value. Both these functions return Column type as return type.

How do I add column names to a DataFrame in Spark?

1. Using Spark withColumnRenamed – To rename DataFrame column name. Spark has a withColumnRenamed() function on DataFrame to change a column name. This is the most straight forward approach; this function takes two parameters; the first is your existing column name and the second is the new column name you wish for.

Spark 2.2+

Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala):

import org.apache.spark.sql.functions.typedLit

df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))

Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map):

The second argument for DataFrame.withColumn should be a Column so you have to use a literal:

from pyspark.sql.functions import lit

df.withColumn('new_column', lit(10))

If you need complex columns you can build these using blocks like array:

from pyspark.sql.functions import array, create_map, struct

df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))

Exactly the same methods can be used in Scala.

import org.apache.spark.sql.functions.{array, lit, map, struct}

df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))

To provide names for structs use either alias on each field:

df.withColumn(
    "some_struct",
    struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
 )

or cast on the whole object

df.withColumn(
    "some_struct", 
    struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
 )

It is also possible, although slower, to use an UDF.

Note:

The same constructs can be used to pass constant arguments to UDFs or SQL functions.

In spark 2.2 there are two ways to add constant value in a column in DataFrame:

1) Using lit

2) Using typedLit.

The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map

Sample DataFrame:

val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")

+---+----+
| id|col1|
+---+----+
|  0|   a|
|  1|   b|
+---+----+

1) Using lit: Adding constant string value in new column named newcol:

import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))

Result:

+---+----+------+
| id|col1|newcol|
+---+----+------+
|  0|   a| myval|
|  1|   b| myval|
+---+----+------+

2) Using typedLit:

import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))

Result:

+---+----+-----------------+
| id|col1|           newcol|
+---+----+-----------------+
|  0|   a|[sample,10,0.044]|
|  1|   b|[sample,10,0.044]|
|  2|   c|[sample,10,0.044]|
+---+----+-----------------+

As the other answers have described, lit and typedLit are how to add constant columns to DataFrames. lit is an important Spark function that you will use frequently, but not for adding constant columns to DataFrames.

You'll commonly be using lit to create org.apache.spark.sql.Column objects because that's the column type required by most of the org.apache.spark.sql.functions.

Suppose you have a DataFrame with a some_date DateType column and would like to add a column with the days between December 31, 2020 and some_date.

Here's your DataFrame:

+----------+
| some_date|
+----------+
|2020-09-23|
|2020-01-05|
|2020-04-12|
+----------+

Here's how to calculate the days till the year end:

val diff = datediff(lit(Date.valueOf("2020-12-31")), col("some_date"))
df
  .withColumn("days_till_yearend", diff)
  .show()

+----------+-----------------+
| some_date|days_till_yearend|
+----------+-----------------+
|2020-09-23|               99|
|2020-01-05|              361|
|2020-04-12|              263|
+----------+-----------------+

You could also use lit to create a year_end column and compute the days_till_yearend like so:

import java.sql.Date

df
  .withColumn("yearend", lit(Date.valueOf("2020-12-31")))
  .withColumn("days_till_yearend", datediff(col("yearend"), col("some_date")))
  .show()

+----------+----------+-----------------+
| some_date|   yearend|days_till_yearend|
+----------+----------+-----------------+
|2020-09-23|2020-12-31|               99|
|2020-01-05|2020-12-31|              361|
|2020-04-12|2020-12-31|              263|
+----------+----------+-----------------+

Most of the time, you don't need to use lit to append a constant column to a DataFrame. You just need to use lit to convert a Scala type to a org.apache.spark.sql.Column object because that's what's required by the function.

See the datediff function signature:

enter image description here

As you can see, datediff requires two Column arguments.

Related questions
                            
                                Pipenv: Command Not Found
                            
                                Cosine Similarity between 2 Number Lists
                            
                                UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function [duplicate]
                            
                                Correct way to define Python source code encoding
                            
                                Python Git Module experiences? [closed]
                            
                                How to JSON serialize sets?
                            
                                Pickle incompatibility of numpy arrays between Python 2 and 3
                            
                                How do I know if I can disable SQLALCHEMY_TRACK_MODIFICATIONS?
                            
                                How to get the size of a string in Python?
                            
                                Writing to an Excel spreadsheet
                            
                                AttributeError: 'module' object has no attribute 'urlopen'
                            
                                How to install 2 Anacondas (Python 2 and 3) on Mac OS
                            
                                What is the most efficient string concatenation method in python?
                            
                                How could I use requests in asyncio?
                            
                                range() for floats
                            
                                Clear text from textarea with selenium
                            
                                Convert decimal to binary in python [duplicate]
                            
                                'Conda' is not recognized as internal or external command
                            
                                How to display a float with two decimal places?
                            
                                setuptools vs. distutils: why is distutils still a thing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to add a constant column in a Spark DataFrame?

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

People also ask

Recent Activity

Donate For Us