I find it hard to understand the difference between these two methods from pyspark.sql.functions
as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))
The results are:
Column<b'col_name'>
Column<b'col_name'>
so what are the difference between the two and when should I use one and not the other?
The PySpark SQL functions lit() are used to add a new column to the DataFrame by assigning a literal or constant value.
col (col: str) → pyspark.sql.column.Column[source] Returns a Column based on the given column name.
PySpark SQL is a Spark library for structured data. Unlike the PySpark RDD API, PySpark SQL provides more information about the structure of data and its computation. It provides a programming abstraction called DataFrames.
The doc says:
col:
Returns a Column based on the given column name.
lit:
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
| A|
+---+
| a|
| b|
| c|
+---+
If using col
to create a new column from A
:
>>> df.withColumn("new", F.col("A")).show()
+---+---+
| A|new|
+---+---+
| a| a|
| b| b|
| c| c|
+---+---+
So col
grabs an existing column with the given name, F.col("A")
is equivalent to df.A
or df["A"]
here.
If using F.lit("A")
to create the column:
>>> df.withColumn("new", F.lit("A")).show()
+---+---+
| A|new|
+---+---+
| a| A|
| b| A|
| c| A|
+---+---+
While lit
will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With