Here's some statement: https://stackoverflow.com/a/45600938/4164722
Dataset.col returns resolved column while col returns unresolved column.
Can someone provide more details? When should I use Dataset.col()
and when functions.col
?
Thanks.
Spark withColumn() is a DataFrame function that is used to add a new column to DataFrame, change the value of an existing column, convert the datatype of a column, derive a new column from an existing column, on this post, I will walk you through commonly used DataFrame column operations with Scala examples.
A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized execution engine.
pyspark. sql. Column class provides several functions to work with DataFrame to manipulate the Column values, evaluate the boolean expression to filter rows, retrieve a value or part of a value from a DataFrame column, and to work with list, map & struct columns.
Dataset is a data structure in SparkSQL which is strongly typed and is a map to a relational schema. It represents structured queries with encoders. It is an extension to data frame API. Spark Dataset provides both type safety and object-oriented programming interface.
In majority of contexts there is no practical difference. For example:
val df: Dataset[Row] = ???
df.select(df.col("foo"))
df.select(col("foo"))
are equivalent, same as:
df.where(df.col("foo") > 0)
df.where(col("foo") > 0)
Difference becomes important when provenance matters, for example joins:
val df1: Dataset[Row] = ???
val df2: Dataset[Row] = ???
df1.join(df2, Seq("id")).select(df1.col("foo") =!= df2.col("foo"))
Because Dataset.col
is resolved and bound to a DataFrame
it allows you to unambiguously select column descending from a particular parent. It wouldn't be possible with col
.
EXPLANATION:
At times you may want to programmatically pre-create (i.e. ahead of time) column expressions
for later use -- before the related DataFrame(s) actually exists. In that use-case, col(expression)
can be useful. Generically illustrated using pySpark
syntax:
>>> cX = col('col0') # Define an unresolved column.
>>> cY = col('myCol') # Define another unresolved column.
>>> cX,cY # Show that these are naked column names.
(Column<b'col0'>, Column<b'myCol'>)
Now these are called unresolved
columns because they are not associated with a DataFrame statement to actually know whether those column names actually exist anywhere. However you may, in fact, apply them in a DF context later on, after having prepared them:
>>> df = spark_sesn.createDataFrame([Row(col0=10, col1='Ten', col2=10.0),])
>>> df
>>> DataFrame[col0: bigint, col1: string, col2: double]
>>> df.select(cX).collect()
[Row(col0=10)] # cX is successfully resolved.
>>> df.select(cY).collect()
Traceback (most recent call last): # Oh dear! cY, which represents
[ ... snip ... ] # 'myCol' is truly unresolved here.
# BUT maybe later on it won't be, say,
# after a join() or something else.
CONCLUSION:
col(expression)
can help programmatically decouple the DEFINITION of a column specification with the APPLICATION of it against DataFrame(s) later on. Note that expr(aString)
, which also returns a column specification
, provides a generalization of col('xyz')
, where whole expressions can be DEFINED and later APPLIED:
>>> cZ = expr('col0 + 10') # Creates a column specification / expression.
>>> cZ
Column<b'(col0 + 10)'>
>>> df.select(cZ).collect() # Applying that expression later on.
[Row((col0 + 10)=20)]
I hope this alternative use-case helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With