Here's some statement: https://stackoverflow.com/a/45600938/4164722 <blockquote> Dataset.col returns resolved column while col returns unresolved column. </blockquote> Can someone provide more details? When should I use <code>Dataset.col()</code> and when <code>functions.col</code>? Thanks.

In majority of contexts there is no practical difference. For example: <pre class="prettyprint"><code>val df: Dataset[Row] = ??? df.select(df.col("foo")) df.select(col("foo")) </code></pre> are equivalent, same as: <pre class="prettyprint"><code>df.where(df.col("foo") > 0) df.where(col("foo") > 0) </code></pre> Difference becomes important when provenance matters, for example joins: <pre class="prettyprint"><code>val df1: Dataset[Row] = ??? val df2: Dataset[Row] = ??? df1.join(df2, Seq("id")).select(df1.col("foo") =!= df2.col("foo")) </code></pre> Because <code>Dataset.col</code> is resolved and bound to a <code>DataFrame</code> it allows you to unambiguously select column descending from a particular parent. It wouldn't be possible with <code>col</code>.

EXPLANATION: At times you may want to programmatically pre-create (i.e. ahead of time) <code>column expressions</code> for later use -- before the related DataFrame(s) actually exists. In that use-case, <code>col(expression)</code> can be useful. Generically illustrated using <code>pySpark</code> syntax: <pre class="prettyprint"><code>>>> cX = col('col0') # Define an unresolved column. >>> cY = col('myCol') # Define another unresolved column. >>> cX,cY # Show that these are naked column names. (Column<b'col0'>, Column<b'myCol'>) </code></pre> Now these are called <code>unresolved</code> columns because they are not associated with a DataFrame statement to actually know whether those column names actually exist anywhere. However you may, in fact, apply them in a DF context later on, after having prepared them: <pre class="prettyprint"><code>>>> df = spark_sesn.createDataFrame([Row(col0=10, col1='Ten', col2=10.0),]) >>> df >>> DataFrame[col0: bigint, col1: string, col2: double] >>> df.select(cX).collect() [Row(col0=10)] # cX is successfully resolved. >>> df.select(cY).collect() Traceback (most recent call last): # Oh dear! cY, which represents [ ... snip ... ] # 'myCol' is truly unresolved here. # BUT maybe later on it won't be, say, # after a join() or something else. </code></pre> CONCLUSION: <code>col(expression)</code> can help programmatically decouple the DEFINITION of a column specification with the APPLICATION of it against DataFrame(s) later on. Note that <code>expr(aString)</code>, which also returns a <code>column specification</code>, provides a generalization of <code>col('xyz')</code>, where whole expressions can be DEFINED and later APPLIED: <pre class="prettyprint"><code>>>> cZ = expr('col0 + 10') # Creates a column specification / expression. >>> cZ Column<b'(col0 + 10)'> >>> df.select(cZ).collect() # Applying that expression later on. [Row((col0 + 10)=20)] </code></pre> I hope this alternative use-case helps.

What's the difference between Dataset.col() and functions.col() in Spark?

2 Answers

In majority of contexts there is no practical difference. For example:

val df: Dataset[Row] = ???

df.select(df.col("foo"))
df.select(col("foo"))

are equivalent, same as:

df.where(df.col("foo") > 0)
df.where(col("foo") > 0)

Difference becomes important when provenance matters, for example joins:

val df1: Dataset[Row] = ???
val df2: Dataset[Row] = ???

df1.join(df2, Seq("id")).select(df1.col("foo") =!= df2.col("foo"))

Because Dataset.col is resolved and bound to a DataFrame it allows you to unambiguously select column descending from a particular parent. It wouldn't be possible with col.

200

answered Oct 21 '22 15:10

user9137650

EXPLANATION:

At times you may want to programmatically pre-create (i.e. ahead of time) column expressions for later use -- before the related DataFrame(s) actually exists. In that use-case, col(expression) can be useful. Generically illustrated using pySpark syntax:

>>> cX = col('col0')  # Define an unresolved column.                                                                           
>>> cY = col('myCol') # Define another unresolved column.                                                  
>>> cX,cY             # Show that these are naked column names.                                                                                            
(Column<b'col0'>, Column<b'myCol'>)

Now these are called unresolved columns because they are not associated with a DataFrame statement to actually know whether those column names actually exist anywhere. However you may, in fact, apply them in a DF context later on, after having prepared them:

>>> df = spark_sesn.createDataFrame([Row(col0=10, col1='Ten', col2=10.0),])                                
>>> df                                                                                                     
>>> DataFrame[col0: bigint, col1: string, col2: double]

>>> df.select(cX).collect()                                                                                
[Row(col0=10)]                      # cX is successfully resolved.

>>> df.select(cY).collect()                                                                                
Traceback (most recent call last):  # Oh dear! cY, which represents
[ ... snip ... ]                    # 'myCol' is truly unresolved here.
                                    # BUT maybe later on it won't be, say,
                                    # after a join() or something else.

CONCLUSION:

col(expression) can help programmatically decouple the DEFINITION of a column specification with the APPLICATION of it against DataFrame(s) later on. Note that expr(aString), which also returns a column specification, provides a generalization of col('xyz'), where whole expressions can be DEFINED and later APPLIED:

>>> cZ = expr('col0 + 10')   # Creates a column specification / expression.
>>> cZ
Column<b'(col0 + 10)'>

>>> df.select(cZ).collect() # Applying that expression later on.
[Row((col0 + 10)=20)]

I hope this alternative use-case helps.

answered Oct 21 '22 17:10

NYCeyes

Related questions
                            
                                spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted
                            
                                How to convert a sparse vector to dense in Scala Spark?
                            
                                Spark looses all executors one minute after starting
                            
                                how to obtain the trained best model from a crossvalidator
                            
                                spark group multiple rdd items by key
                            
                                no valid constructor on spark
                            
                                Many skipped stages for Pregel in Spark UI
                            
                                Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?
                            
                                What's the performance impact of converting between `DataFrame`, `RDD` and back?
                            
                                Spark submit YARN mode HADOOP_CONF_DIR contents
                            
                                apache spark master ui not working
                            
                                spark "basePath" option setting
                            
                                Access names of fields in struct Spark SQL
                            
                                Spark SQL's Scala API - TimestampType - No Encoder found for org.apache.spark.sql.types.TimestampType
                            
                                Spark dataframe add a row for every existing row
                            
                                Change the Datatype of columns in PySpark dataframe
                            
                                Java & Spark : add unique incremental id to dataset
                            
                                Pyspark transform method that's equivalent to the Scala Dataset#transform method
                            
                                How to query datasets in avro format?
                            
                                How to standardize ONE column in Spark using StandardScaler?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between Dataset.col() and functions.col() in Spark?

Tags:

apache-spark

apache-spark-sql

secfree

People also ask

2 Answers

user9137650

NYCeyes

Recent Activity

Donate For Us