I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas:
expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist())
But now I am trying to do the same thing in pySpark as follows:
expertsDF = df.groupBy('session').agg(lambda x: x.collect())
and I am getting the error:
all exprs should be Column
I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar.
An example input for it would be a dataframe:
session name
1 a
1 b
2 v
2 c
output:
session name
1 [a, b....]
2 [v, c....]
Spark provides a createDataFrame(pandas_dataframe) method to convert pandas to Spark DataFrame, Spark by default infers the schema based on the pandas data types to PySpark data types. If you want all data types to String use spark. createDataFrame(pandasDF. astype(str)) .
pandas-on-Spark DataFrame and Spark DataFrame are virtually interchangeable. However, note that a new default index is created when pandas-on-Spark DataFrame is created from Spark DataFrame. See Default Index Type. In order to avoid this overhead, specify the column to use as an index when possible.
Import the pandas library and create a Pandas Dataframe using the DataFrame() method. Create a spark session by importing the SparkSession from the pyspark library. Enable the apache arrow using the conf property. Pass the Pandas dataframe to the createDataFrame() method of the SparkSession object.
Conclusion. Do not try to replace Pandas with Spark, they are complementary to each other and have each their pros and cons. Whether to use Pandas or Spark depends on your use case. For most Machine Learning tasks, you probably will eventually use Pandas, even if you do your preprocessing with Spark.
You can also use pyspark.sql.functions.collect_list(col) function:
from pyspark.sql.functions import *
df.groupBy('session').agg(collect_list('name'))
You could use reduceByKey()
to do this efficiently:
(df.rdd
.map(lambda x: (x[0],[x[1]]))
.reduceByKey(lambda x,y: x+y)
.toDF(["session", "name"]).show())
+-------+------+
|session| name|
+-------+------+
| 1|[a, b]|
| 2|[v, c]|
+-------+------+
Data:
df = sc.parallelize([(1, "a"),
(1, "b"),
(2, "v"),
(2, "c")]).toDF(["session", "name"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With