Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark transform method that's equivalent to the Scala Dataset#transform method

The Spark Scala API has a Dataset#transform method that makes it easy to chain custom DataFrame transformations like so:

val weirdDf = df
  .transform(myFirstCustomTransformation)
  .transform(anotherCustomTransformation)

I don't see an equivalent transform method for pyspark in the documentation.

Is there a PySpark way to chain custom transformations?

If not, how can the pyspark.sql.DataFrame class be monkey patched to add a transform method?

Update

The transform method was added to PySpark as of PySpark 3.0.

like image 803
Powers Avatar asked Sep 15 '17 20:09

Powers


People also ask

What is Stringindexer PySpark?

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.

What is transform in PySpark?

PySpark RDD Transformations are lazy evaluation and is used to transform/update from one RDD into another. When executed on RDD, it results in a single or multiple new RDD.

Can we use dataset in PySpark?

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized execution engine.


1 Answers

Implementation:

from pyspark.sql.dataframe import DataFrame

def transform(self, f):
    return f(self)

DataFrame.transform = transform

Usage:

spark.range(1).transform(lambda df: df.selectExpr("id * 2"))
like image 160
Alper t. Turker Avatar answered Nov 15 '22 09:11

Alper t. Turker