I find it hard to understand the difference between these two methods from <code>pyspark.sql.functions</code> as the documentation on PySpark official website is not very informative. For example the following code: <pre class="prettyprint lang-py prettyprint-override"><code>import pyspark.sql.functions as F print(F.col('col_name')) print(F.lit('col_name')) </code></pre> The results are: <pre class="prettyprint"><code>Column<b'col_name'> Column<b'col_name'> </code></pre> so what are the difference between the two and when should I use one and not the other?

The doc says: col: Returns a Column based on the given column name. lit: Creates a Column of literal value <hr> Say if we have a data frame as below: <pre class="prettyprint"><code>>>> import pyspark.sql.functions as F >>> from pyspark.sql.types import * >>> schema = StructType([StructField('A', StringType(), True)]) >>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema) >>> df.show() +---+ | A| +---+ | a| | b| | c| +---+ </code></pre> If using <code>col</code> to create a new column from <code>A</code>: <pre class="prettyprint"><code>>>> df.withColumn("new", F.col("A")).show() +---+---+ | A|new| +---+---+ | a| a| | b| b| | c| c| +---+---+ </code></pre> So <code>col</code> grabs an existing column with the given name, <code>F.col("A")</code> is equivalent to <code>df.A</code> or <code>df["A"]</code> here. If using <code>F.lit("A")</code> to create the column: <pre class="prettyprint"><code>>>> df.withColumn("new", F.lit("A")).show() +---+---+ | A|new| +---+---+ | a| A| | b| A| | c| A| +---+---+ </code></pre> While <code>lit</code> will create a constant column with the given string as the values. Both of them return a Column object but the content and meaning are different.

PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit

Tags:

apache-spark-sql

pyspark

pyspark-sql

I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:

import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))

The results are:

Column<b'col_name'>
Column<b'col_name'>

so what are the difference between the two and when should I use one and not the other?

463

asked Sep 24 '17 04:09

Jing

1 Answers

The doc says:

col:

Returns a Column based on the given column name.

lit:

Creates a Column of literal value

Say if we have a data frame as below:

>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *

>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
|  A|
+---+
|  a|
|  b|
|  c|
+---+

If using col to create a new column from A:

>>> df.withColumn("new", F.col("A")).show()
+---+---+
|  A|new|
+---+---+
|  a|  a|
|  b|  b|
|  c|  c|
+---+---+

So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.

If using F.lit("A") to create the column:

>>> df.withColumn("new", F.lit("A")).show()
+---+---+
|  A|new|
+---+---+
|  a|  A|
|  b|  A|
|  c|  A|
+---+---+

While lit will create a constant column with the given string as the values.

Both of them return a Column object but the content and meaning are different.

198

answered Oct 14 '22 03:10

Psidom

Related questions
                            
                                Convert Pyspark Dataframe column from array to new columns
                            
                                Amazon EMR Pyspark Module not found
                            
                                Pyspark import .py file not working
                            
                                pyspark: sparse vectors to scipy sparse matrix
                            
                                Count number of duplicate rows in SPARKSQL
                            
                                Setting YARN queue in PySpark
                            
                                Can I change SparkContext.appName on the fly?
                            
                                How to transform data with sliding window over time series data in Pyspark
                            
                                PySpark: Randomize rows in dataframe
                            
                                How to find pyspark dataframe memory usage?
                            
                                User defined function to be applied to Window in PySpark?
                            
                                Pyspark ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:50532)
                            
                                Calculating percentage of total count for groupBy using pyspark
                            
                                collect() or toPandas() on a large DataFrame in pyspark/EMR
                            
                                How to find out the amount of memory pyspark has from iPython interface?
                            
                                Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
                            
                                How to name file when saveAsTextFile in spark?
                            
                                Get the max value for each key in a Spark RDD
                            
                                Broadcast hash join - Iterative
                            
                                How to select a same-size stratified sample from a dataframe in Apache Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With