Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between Dataset.col() and functions.col() in Spark?

Here's some statement: https://stackoverflow.com/a/45600938/4164722

Dataset.col returns resolved column while col returns unresolved column.

Can someone provide more details? When should I use Dataset.col() and when functions.col?

Thanks.

like image 959
secfree Avatar asked Dec 25 '17 01:12

secfree


People also ask

What is Col function in Spark?

Spark withColumn() is a DataFrame function that is used to add a new column to DataFrame, change the value of an existing column, convert the datatype of a column, derive a new column from an existing column, on this post, I will walk you through commonly used DataFrame column operations with Scala examples.

What is Dataset in Spark?

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized execution engine.

What Pyspark SQL functions Col?

pyspark. sql. Column class provides several functions to work with DataFrame to manipulate the Column values, evaluate the boolean expression to filter rows, retrieve a value or part of a value from a DataFrame column, and to work with list, map & struct columns.

What is Dataset in Spark with example?

Dataset is a data structure in SparkSQL which is strongly typed and is a map to a relational schema. It represents structured queries with encoders. It is an extension to data frame API. Spark Dataset provides both type safety and object-oriented programming interface.


2 Answers

In majority of contexts there is no practical difference. For example:

val df: Dataset[Row] = ???

df.select(df.col("foo"))
df.select(col("foo"))

are equivalent, same as:

df.where(df.col("foo") > 0)
df.where(col("foo") > 0)

Difference becomes important when provenance matters, for example joins:

val df1: Dataset[Row] = ???
val df2: Dataset[Row] = ???

df1.join(df2, Seq("id")).select(df1.col("foo") =!= df2.col("foo"))

Because Dataset.col is resolved and bound to a DataFrame it allows you to unambiguously select column descending from a particular parent. It wouldn't be possible with col.

like image 200
user9137650 Avatar answered Oct 21 '22 15:10

user9137650


EXPLANATION:

At times you may want to programmatically pre-create (i.e. ahead of time) column expressions for later use -- before the related DataFrame(s) actually exists. In that use-case, col(expression) can be useful. Generically illustrated using pySpark syntax:

>>> cX = col('col0')  # Define an unresolved column.                                                                           
>>> cY = col('myCol') # Define another unresolved column.                                                  
>>> cX,cY             # Show that these are naked column names.                                                                                            
(Column<b'col0'>, Column<b'myCol'>)

Now these are called unresolved columns because they are not associated with a DataFrame statement to actually know whether those column names actually exist anywhere. However you may, in fact, apply them in a DF context later on, after having prepared them:

>>> df = spark_sesn.createDataFrame([Row(col0=10, col1='Ten', col2=10.0),])                                
>>> df                                                                                                     
>>> DataFrame[col0: bigint, col1: string, col2: double]

>>> df.select(cX).collect()                                                                                
[Row(col0=10)]                      # cX is successfully resolved.

>>> df.select(cY).collect()                                                                                
Traceback (most recent call last):  # Oh dear! cY, which represents
[ ... snip ... ]                    # 'myCol' is truly unresolved here.
                                    # BUT maybe later on it won't be, say,
                                    # after a join() or something else.

CONCLUSION:

col(expression) can help programmatically decouple the DEFINITION of a column specification with the APPLICATION of it against DataFrame(s) later on. Note that expr(aString), which also returns a column specification, provides a generalization of col('xyz'), where whole expressions can be DEFINED and later APPLIED:

>>> cZ = expr('col0 + 10')   # Creates a column specification / expression.
>>> cZ
Column<b'(col0 + 10)'>

>>> df.select(cZ).collect() # Applying that expression later on.
[Row((col0 + 10)=20)]

I hope this alternative use-case helps.

like image 38
NYCeyes Avatar answered Oct 21 '22 17:10

NYCeyes