Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit

I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:

import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))

The results are:

Column<b'col_name'>
Column<b'col_name'>

so what are the difference between the two and when should I use one and not the other?

like image 463
Jing Avatar asked Sep 24 '17 04:09

Jing


People also ask

What is PySpark SQL functions lit?

The PySpark SQL functions lit() are used to add a new column to the DataFrame by assigning a literal or constant value.

What is Col function in PySpark?

col (col: str) → pyspark.sql.column.Column[source] Returns a Column based on the given column name.

What is the difference between PySpark and Spark SQL?

PySpark SQL is a Spark library for structured data. Unlike the PySpark RDD API, PySpark SQL provides more information about the structure of data and its computation. It provides a programming abstraction called DataFrames.


1 Answers

The doc says:

col:

Returns a Column based on the given column name.

lit:

Creates a Column of literal value


Say if we have a data frame as below:

>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *

>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
|  A|
+---+
|  a|
|  b|
|  c|
+---+

If using col to create a new column from A:

>>> df.withColumn("new", F.col("A")).show()
+---+---+
|  A|new|
+---+---+
|  a|  a|
|  b|  b|
|  c|  c|
+---+---+

So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.

If using F.lit("A") to create the column:

>>> df.withColumn("new", F.lit("A")).show()
+---+---+
|  A|new|
+---+---+
|  a|  A|
|  b|  A|
|  c|  A|
+---+---+

While lit will create a constant column with the given string as the values.

Both of them return a Column object but the content and meaning are different.

like image 198
Psidom Avatar answered Oct 14 '22 03:10

Psidom