I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas: <pre class="prettyprint"><code>expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist()) </code></pre> But now I am trying to do the same thing in pySpark as follows: <pre class="prettyprint"><code>expertsDF = df.groupBy('session').agg(lambda x: x.collect()) </code></pre> and I am getting the error: <pre class="prettyprint"><code>all exprs should be Column </code></pre> I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar. An example input for it would be a dataframe: <pre class="prettyprint"><code>session name 1 a 1 b 2 v 2 c </code></pre> output: <pre class="prettyprint"><code>session name 1 [a, b....] 2 [v, c....] </code></pre>

You can also use pyspark.sql.functions.collect_list(col) function: <pre class="prettyprint"><code>from pyspark.sql.functions import * df.groupBy('session').agg(collect_list('name')) </code></pre>

turning pandas to pyspark expression

Tags:

python

pandas

group-by

apache-spark

pyspark

I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas:

expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist())

But now I am trying to do the same thing in pySpark as follows:

expertsDF = df.groupBy('session').agg(lambda x: x.collect())

and I am getting the error:

all exprs should be Column

I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar.

An example input for it would be a dataframe:

session     name
1           a
1           b
2           v
2           c

output:

session    name
1          [a, b....]
2          [v, c....]

558

asked Oct 22 '16 16:10

Kratos

2 Answers

You can also use pyspark.sql.functions.collect_list(col) function:

from pyspark.sql.functions import *

df.groupBy('session').agg(collect_list('name'))

answered Sep 26 '22 00:09

MaxU - stop WAR against UA

You could use reduceByKey() to do this efficiently:

(df.rdd
 .map(lambda x: (x[0],[x[1]]))
 .reduceByKey(lambda x,y: x+y)
 .toDF(["session", "name"]).show())
+-------+------+
|session|  name|
+-------+------+
|      1|[a, b]|
|      2|[v, c]|
+-------+------+

Data:

df = sc.parallelize([(1, "a"),
                     (1, "b"),
                     (2, "v"),
                     (2, "c")]).toDF(["session", "name"])

answered Sep 25 '22 00:09

mtoto

Related questions
                            
                                django attribute error : object has no attribute 'get_bound_field'
                            
                                Django - Why are variables declared in Model Classes Static
                            
                                Best way to do file transfer via SCP using python and a .pem file [duplicate]
                            
                                Cumulative Set in PANDAS
                            
                                scipy.sparse matrix: subtract row mean to nonzero elements
                            
                                How to split up a string on multiple delimiters but only capture some?
                            
                                Python - Stacking two histograms with a scatter plot
                            
                                Calculating the Manhattan distance in the eight puzzle
                            
                                Getting substring based on another column in a pandas dataframe
                            
                                Using textwrap.dedent() with bytes in Python 3
                            
                                Facebook Ads Python: How to access the HTTP header returned?
                            
                                Comparing Arrays for Accuracy
                            
                                Pandas slow on data frame replace
                            
                                Comparing numbers give the wrong result in Python
                            
                                How to use sublime 3 jinja2 highlighter
                            
                                Reading a text file using Pandas where some rows have empty elements?
                            
                                xlsxwriter.Workbook AttributeError: 'module' object has no attribute 'Workbook'
                            
                                Feature_importance vector in Decision Trees in SciKit Learn along with feature names
                            
                                Can i use Docker for creating exe using pyinstaller
                            
                                How can I use multiple parameters using pandas pd.read_sql_query?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With