The documentation for Pyspark shows DataFrames being constructed from <code>sqlContext</code>, <code>sqlContext.read()</code>, and a variety of other methods. (See https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html) Is it possible to subclass Dataframe and instantiate it independently? I would like to add methods and functionality to the base DataFrame class.

It really depends on your goals. <ul> <li> Technically speaking it is possible. <code>pyspark.sql.DataFrame</code> is just a plain Python class. You can extend it or monkey-patch if you need. <pre class="prettyprint"><code>from pyspark.sql import DataFrame class DataFrameWithZipWithIndex(DataFrame): def __init__(self, df): super(self.__class__, self).__init__(df._jdf, df.sql_ctx) def zipWithIndex(self): return (self.rdd .zipWithIndex() .map(lambda row: (row[1], ) + row[0]) .toDF(["_idx"] + self.columns)) </code></pre> Example usage: <pre class="prettyprint"><code>df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"]) with_zipwithindex = DataFrameWithZipWithIndex(df) isinstance(with_zipwithindex, DataFrame) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>True </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>with_zipwithindex.zipWithIndex().show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+----+---+---+ |_idx|foo|bar| +----+---+---+ | 0| a| 1| +----+---+---+ </code></pre> </li> <li> Practically speaking you won't be able to do much here. <code>DataFrame</code> is an thin wrapper around JVM object and doesn't do much beyond providing docstrings, converting arguments to the form required natively, calling JVM methods, and wrapping the results using Python adapters if necessary. With plain Python code you won't be able to even go near <code>DataFrame</code> / <code>Dataset</code> internals or modify its core behavior. If you're looking for standalone, Python only Spark <code>DataFrame</code> implementation it is not possible. </li> </ul>

Is it possible to subclass DataFrame in Pyspark?

1 Answers

It really depends on your goals.

Technically speaking it is possible. pyspark.sql.DataFrame is just a plain Python class. You can extend it or monkey-patch if you need.

from pyspark.sql import DataFrame

class DataFrameWithZipWithIndex(DataFrame):
     def __init__(self, df):
         super(self.__class__, self).__init__(df._jdf, df.sql_ctx)

     def zipWithIndex(self):
         return (self.rdd
             .zipWithIndex()
             .map(lambda row: (row[1], ) + row[0])
             .toDF(["_idx"] + self.columns))

Example usage:

df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"])

with_zipwithindex = DataFrameWithZipWithIndex(df)

isinstance(with_zipwithindex, DataFrame)

True

with_zipwithindex.zipWithIndex().show()

+----+---+---+
|_idx|foo|bar|
+----+---+---+
|   0|  a|  1|
+----+---+---+

Practically speaking you won't be able to do much here. DataFrame is an thin wrapper around JVM object and doesn't do much beyond providing docstrings, converting arguments to the form required natively, calling JVM methods, and wrapping the results using Python adapters if necessary.

With plain Python code you won't be able to even go near DataFrame / Dataset internals or modify its core behavior. If you're looking for standalone, Python only Spark DataFrame implementation it is not possible.

184

answered Oct 14 '22 05:10

zero323

Related questions
                            
                                'None' is not displayed as I expected in Python interactive mode
                            
                                What is the equivalent of Matlab's imadjust in python?
                            
                                How to calculate count and percentage in groupby in Python
                            
                                ServerSelectionTimeoutError when connecting to aws with pymongo
                            
                                Pandas: query string where column name contains special characters
                            
                                Conditionally calculated column for a Pandas DataFrame
                            
                                How can I change the (locale) thousands separator in Python to Arabic Unicode separator?
                            
                                python use Pyyaml and keep format
                            
                                Python pandas select rows by list of dates
                            
                                Vertical alignment of matplotlib legend labels with LaTeX math
                            
                                python - sklearn Latent Dirichlet Allocation Transform v. Fittransform
                            
                                Apache Spark reads for S3: can't pickle thread.lock objects
                            
                                Python: Trimming underscores from end of String
                            
                                Python3 reading a binary file, 4 bytes at a time and xor it with a 4 byte long key
                            
                                How to draw a small graph with community structure in networkx
                            
                                TypeError: __init__() takes 1 positional argument but 2 were given
                            
                                Why xgboost.cv and sklearn.cross_val_score give different results?
                            
                                How do I rename a superclass's method in python?
                            
                                What's the alternative to pandas chain indexing?
                            
                                xlsxwriter not applying format to header row of dataframe - Python Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to subclass DataFrame in Pyspark?

Tags:

python

oop

python-2.7

apache-spark

pyspark

jerzy

People also ask

1 Answers

zero323

Recent Activity

Donate For Us