Currently working on PySpark. There is no map function on <code>DataFrame</code>, and one has to go to <code>RDD</code> for <code>map</code> function. In Scala there is a <code>map</code> on <code>DataFrame</code>, is there any reason for this?

<code>Dataset.map</code> is not part of the <code>DataFrame</code> (<code>Dataset[Row]</code>) API. It transforms strongly typed <code>Dataset[T]</code> into strongly typed <code>Dataset[U]</code>: <pre class="prettyprint"><code>def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U] </code></pre> and there is simply no place for Python in the strongly typed <code>Dataset</code> world. In general, <code>Datasets</code> are native JVM objects (unlike <code>RDD</code> it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the <code>Encoder</code> API, data would still have to be converted to <code>RDD</code> for computations. In contrast Python implements its own <code>map</code> like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API. That includes both typical <code>udfs</code> (in particular <code>SCALAR</code> and <code>SCALAR_ITER</code> variants) as well as map-like variants - <code>GROUPED_MAP</code> and <code>MAP_ITER</code> applied through <code>GroupedData.apply</code> and <code>DataFrame.mapInPandas</code> (Spark >= 3.0.0) respectively.

Why is no map function for dataframe in pyspark while the spark equivalent has it?

1 Answers

Dataset.map is not part of the DataFrame (Dataset[Row]) API. It transforms strongly typed Dataset[T] into strongly typed Dataset[U]:

def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

and there is simply no place for Python in the strongly typed Dataset world. In general, Datasets are native JVM objects (unlike RDD it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the Encoder API, data would still have to be converted to RDD for computations.

In contrast Python implements its own map like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API.

That includes both typical udfs (in particular SCALAR and SCALAR_ITER variants) as well as map-like variants - GROUPED_MAP and MAP_ITER applied through GroupedData.apply and DataFrame.mapInPandas (Spark >= 3.0.0) respectively.

answered Oct 08 '22 14:10

zero323

Related questions
                            
                                Saving / exporting transformed DataFrame back to JDBC / MySQL
                            
                                Basic linear algebra on spark matrices
                            
                                Connecting/Integrating Cassandra with Spark (pyspark)
                            
                                How to know when to repartition/coalesce RDD with unbalanced partitions (without shuffling possibly)?
                            
                                Error from python worker: /bin/python: No module named pyspark
                            
                                Spark - Difference between sortBy and sortByKey
                            
                                Connecting IPython notebook to spark master running in different machines
                            
                                Spark - How can get the Logical / Physical Query execution using - Thirft - Hive Interactor
                            
                                Spark DataFrame not respecting schema and considering everything as String
                            
                                Spark Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements?
                            
                                Spark sql top n per group
                            
                                org.apache.thrift.transport.TTransportException error while Reading large JSON file in zeppelin scala
                            
                                How to split column of vectors into two columns?
                            
                                Running steps of EMR in parallel
                            
                                How Spark handle data larger than cluster memory
                            
                                Dropping nested column of Dataframe with PySpark
                            
                                Best practice to create SparkSession object in Scala to use both in unittest and spark-submit
                            
                                Add months to date column in Spark dataframe
                            
                                What does "pre-built for Apache Hadoop 2.7 and later" mean?
                            
                                How can I obtain the DAG of an Apache Spark job without running it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is no map function for dataframe in pyspark while the spark equivalent has it?

Tags:

apache-spark

pyspark

Raghavan

People also ask

1 Answers

zero323

Recent Activity

Donate For Us