Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is no map function for dataframe in pyspark while the spark equivalent has it?

Currently working on PySpark. There is no map function on DataFrame, and one has to go to RDD for map function. In Scala there is a map on DataFrame, is there any reason for this?

like image 976
Raghavan Avatar asked Nov 17 '17 05:11

Raghavan


People also ask

Can we use map on DataFrame in Spark?

Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame, and Dataset and finally returns a new RDD/Dataset respectively. In this article, you will learn the syntax and usage of the map() transformation with an RDD & DataFrame example.

How do you convert a DataFrame to a map in PySpark?

Solution: PySpark SQL function create_map() is used to convert selected DataFrame columns to MapType , create_map() takes a list of columns you wanted to convert as an argument and returns a MapType column.

How do you make a PySpark map?

Create PySpark MapTypeMapType and use MapType() constructor to create a map object. MapType Key Points: The First param keyType is used to specify the type of the key in the map. The Second param valueType is used to specify the type of the value in the map.

What is the function of the map () in Spark?

A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD.


1 Answers

Dataset.map is not part of the DataFrame (Dataset[Row]) API. It transforms strongly typed Dataset[T] into strongly typed Dataset[U]:

def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U] 

and there is simply no place for Python in the strongly typed Dataset world. In general, Datasets are native JVM objects (unlike RDD it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the Encoder API, data would still have to be converted to RDD for computations.

In contrast Python implements its own map like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API.

That includes both typical udfs (in particular SCALAR and SCALAR_ITER variants) as well as map-like variants - GROUPED_MAP and MAP_ITER applied through GroupedData.apply and DataFrame.mapInPandas (Spark >= 3.0.0) respectively.

like image 88
zero323 Avatar answered Oct 08 '22 14:10

zero323