Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

turning pandas to pyspark expression

I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas:

expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist())

But now I am trying to do the same thing in pySpark as follows:

expertsDF = df.groupBy('session').agg(lambda x: x.collect())

and I am getting the error:

all exprs should be Column

I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar.

An example input for it would be a dataframe:

session     name
1           a
1           b
2           v
2           c

output:

session    name
1          [a, b....]
2          [v, c....] 
like image 558
Kratos Avatar asked Oct 22 '16 16:10

Kratos


People also ask

How do I convert pandas to PySpark?

Spark provides a createDataFrame(pandas_dataframe) method to convert pandas to Spark DataFrame, Spark by default infers the schema based on the pandas data types to PySpark data types. If you want all data types to String use spark. createDataFrame(pandasDF. astype(str)) .

Can pandas be used in PySpark?

pandas-on-Spark DataFrame and Spark DataFrame are virtually interchangeable. However, note that a new default index is created when pandas-on-Spark DataFrame is created from Spark DataFrame. See Default Index Type. In order to avoid this overhead, specify the column to use as an index when possible.

How do I change from pandas to Spark?

Import the pandas library and create a Pandas Dataframe using the DataFrame() method. Create a spark session by importing the SparkSession from the pyspark library. Enable the apache arrow using the conf property. Pass the Pandas dataframe to the createDataFrame() method of the SparkSession object.

Can PySpark replace pandas?

Conclusion. Do not try to replace Pandas with Spark, they are complementary to each other and have each their pros and cons. Whether to use Pandas or Spark depends on your use case. For most Machine Learning tasks, you probably will eventually use Pandas, even if you do your preprocessing with Spark.


2 Answers

You can also use pyspark.sql.functions.collect_list(col) function:

from pyspark.sql.functions import *

df.groupBy('session').agg(collect_list('name'))
like image 60
MaxU - stop WAR against UA Avatar answered Sep 26 '22 00:09

MaxU - stop WAR against UA


You could use reduceByKey() to do this efficiently:

(df.rdd
 .map(lambda x: (x[0],[x[1]]))
 .reduceByKey(lambda x,y: x+y)
 .toDF(["session", "name"]).show())
+-------+------+
|session|  name|
+-------+------+
|      1|[a, b]|
|      2|[v, c]|
+-------+------+

Data:

df = sc.parallelize([(1, "a"),
                     (1, "b"),
                     (2, "v"),
                     (2, "c")]).toDF(["session", "name"])
like image 35
mtoto Avatar answered Sep 25 '22 00:09

mtoto