The Spark Scala API has a Dataset#transform
method that makes it easy to chain custom DataFrame transformations like so:
val weirdDf = df
.transform(myFirstCustomTransformation)
.transform(anotherCustomTransformation)
I don't see an equivalent transform
method for pyspark in the documentation.
Is there a PySpark way to chain custom transformations?
If not, how can the pyspark.sql.DataFrame
class be monkey patched to add a transform
method?
Update
The transform method was added to PySpark as of PySpark 3.0.
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.
PySpark RDD Transformations are lazy evaluation and is used to transform/update from one RDD into another. When executed on RDD, it results in a single or multiple new RDD.
A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized execution engine.
Implementation:
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
Usage:
spark.range(1).transform(lambda df: df.selectExpr("id * 2"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With