PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

1 Answers

In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source) and thus are not exactly the same.

We can illustrate with a small example:

df = spark.createDataFrame(
    [(1,'a', 0), (2,'b',None), (None,'c',3)], 
    ['col', '2col', 'third col']
)

df.show()
#+----+----+---------+
#| col|2col|third col|
#+----+----+---------+
#|   1|   a|        0|
#|   2|   b|     null|
#|null|   c|        3|
#+----+----+---------+

1. `df.col`

This is the least flexible. You can only reference columns that are valid to be accessed using the . operator. This rules out column names containing spaces or special characters and column names that start with an integer.

This syntax makes a call to df.__getattr__("col").

print(df.__getattr__.__doc__)
#Returns the :class:`Column` denoted by ``name``.
#
#        >>> df.select(df.age).collect()
#        [Row(age=2), Row(age=5)]
#
#        .. versionadded:: 1.3

Using the . syntax, you can only access the first column of this example dataframe.

>>> df.2col
  File "<ipython-input-39-8e82c2dd5b7c>", line 1
    df.2col
       ^
SyntaxError: invalid syntax

Under the hood, it checks to see if the column name is contained in df.columns and then returns the pyspark.sql.Column specified.

2. `df["col"]`

This makes a call to df.__getitem__. You have some more flexibility in that you can do everything that __getattr__ can do, plus you can specify any column name.

df["2col"]
#Column<2col>

Once again, under the hood some conditionals are checked and in this case the pyspark.sql.Column specified by the input string is returned.

In addition, you can as pass in multiple columns (as a list or tuple) or column expressions.

from pyspark.sql.functions import expr
df[['col', expr('`third col` IS NULL')]].show()
#+----+-------------------+
#| col|(third col IS NULL)|
#+----+-------------------+
#|   1|              false|
#|   2|               true|
#|null|              false|
#+----+-------------------+

Note that in the case of multiple columns, __getitem__ is just making a call to pyspark.sql.DataFrame.select.

Finally, you can also access columns by index:

df[2]
#Column<third col>

3. `pyspark.sql.functions.col`

This is the Spark native way of selecting a column and returns a expression (this is the case for all column functions) which selects the column on based on the given name. This is useful shorthand when you need to specify that you want a column and not a string literal.

For example, supposed we wanted to make a new column that would take on the either the value from "col" or "third col" based on the value of "2col":

from pyspark.sql.functions import when

df.withColumn(
    'new', 
    f.when(df['2col'].isin(['a', 'c']), 'third col').otherwise('col')
).show()
#+----+----+---------+---------+
#| col|2col|third col|      new|
#+----+----+---------+---------+
#|   1|   a|        0|third col|
#|   2|   b|     null|      col|
#|null|   c|        3|third col|
#+----+----+---------+---------+

Oops, that's not what I meant. Spark thought I wanted the literal strings "col" and "third col". Instead, what I should have written is:

from pyspark.sql.functions import col
df.withColumn(
    'new', 
    when(df['2col'].isin(['a', 'c']), col('third col')).otherwise(col('col'))
).show()
#+----+----+---------+---+
#| col|2col|third col|new|
#+----+----+---------+---+
#|   1|   a|        0|  0|
#|   2|   b|     null|  2|
#|null|   c|        3|  3|
#+----+----+---------+---+

Because is col() creates the column expression without checking there's two interesting side effects of this.

It can be re-used as it's not df specific
It can be used before the df is assigned

age = col('dob') / 365
if_expr = when(age < 18, 'underage').otherwise('adult')

df1 = df.read.csv(path).withColumn('age_category', if_expr)

df2 = df.read.parquet(path)\
    .select('*', age.alias('age'), if_expr.alias('age_category'))

age generates Column<b'(dob / 365)'>
if_expr generates Column<b'CASE WHEN ((dob / 365) < 18) THEN underage ELSE adult END'>

186

answered Nov 18 '22 19:11

pault

Related questions
                            
                                Pandas error "Can only use .str accessor with string values"
                            
                                Wrapping column names in Python Pandas DataFrame or Jupyter Notebooks
                            
                                How to convert a series of one value to float only?
                            
                                Cannot convert string to float in pandas (ValueError)
                            
                                find duplicate rows in a pandas dataframe
                            
                                how to convert a data.frame to tree structure object such as dendrogram
                            
                                Find discrepancies between two tables
                            
                                Use of loc to update a dataframe python pandas
                            
                                pandas read_table usecols error with ":"
                            
                                Pandas: create dataframe without auto ordering column names alphabetically
                            
                                how to get the name of column with maximum value in pyspark dataframe
                            
                                Swapping rows within the same pandas dataframe
                            
                                How do I collect a single column in Spark?
                            
                                find row positions and column names of cells contanining inf in pandas dataframe
                            
                                Loading multiple csv files of a folder into one dataframe
                            
                                Closest value to a specific column in R
                            
                                Pandas Adding Row with All Values Zero
                            
                                R merge two datasets based on specific columns with added condition
                            
                                Pandas: what is a NDFrame object (and what is a non-NDFrame object)
                            
                                How to keep original index of a DataFrame after groupby 2 columns?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

Tags:

reference

dataframe

pyspark

Zilong Z

People also ask

1 Answers

1. `df.col`

2. `df["col"]`

3. `pyspark.sql.functions.col`

pault

Recent Activity

Donate For Us

PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

Tags:

reference

dataframe

pyspark

Zilong Z

People also ask

1 Answers

1. df.col

2. df["col"]

3. pyspark.sql.functions.col

pault

Related questions

Recent Activity

Donate For Us

1. `df.col`

2. `df["col"]`

3. `pyspark.sql.functions.col`