I am having a pyspark dataframe as <pre class="prettyprint"><code>DOCTOR | PATIENT JOHN | SAM JOHN | PETER JOHN | ROBIN BEN | ROSE BEN | GRAY </code></pre> and need to concatenate patient names by rows so that I get the output like: <pre class="prettyprint"><code>DOCTOR | PATIENT JOHN | SAM, PETER, ROBIN BEN | ROSE, GRAY </code></pre> Can anybody help me regarding creating this dataframe in pyspark ? Thanks in advance.

The simplest way I can think of is to use <code>collect_list</code> <pre class="prettyprint"><code>import pyspark.sql.functions as f df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2))) </code></pre>

Concatenating string by rows in pyspark

I am having a pyspark dataframe as

DOCTOR | PATIENT
JOHN   | SAM
JOHN   | PETER
JOHN   | ROBIN
BEN    | ROSE
BEN    | GRAY

and need to concatenate patient names by rows so that I get the output like:

DOCTOR | PATIENT
JOHN   | SAM, PETER, ROBIN
BEN    | ROSE, GRAY

Can anybody help me regarding creating this dataframe in pyspark ?

Thanks in advance.

How do I concatenate multiple rows in Pyspark?

PySpark Concatenate Using concat() concat() function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types string, binary, and compatible array columns.

How do I concatenate columns in Pyspark?

concat() will join two or more columns in the given PySpark DataFrame and add these values into a new column. By using the select() method, we can view the column concatenated, and by using an alias() method, we can name the concatenated column.

The simplest way I can think of is to use collect_list

import pyspark.sql.functions as f
df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))

import pyspark.sql.functions as f
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

data = [
  ("U_104", "food"),
  ("U_103", "cosmetics"),
  ("U_103", "children"),
  ("U_104", "groceries"),
  ("U_103", "food")
]
schema = StructType([
  StructField("user_id", StringType(), True),
  StructField("category", StringType(), True),
])
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("groupby").getOrCreate()
df = spark.createDataFrame(data, schema)
group_df = df.groupBy(f.col("user_id")).agg(
  f.concat_ws(",", f.collect_list(f.col("category"))).alias("categories")
)
group_df.show()

+-------+--------------------+
|user_id|          categories|
+-------+--------------------+
|  U_104|      food,groceries|
|  U_103|cosmetics,childre...|
+-------+--------------------+

There are some useful aggregation examples

pyspark dataframe aggregation examples

Concatenating string by rows in pyspark

Tags:

python

apache-spark

pyspark

Prerit Saxena

People also ask

2 Answers

Assaf Mendelson

DevShepherd

Recent Activity

Donate For Us

Concatenating string by rows in pyspark

Tags:

python

apache-spark

pyspark

Prerit Saxena

People also ask

2 Answers

Assaf Mendelson

DevShepherd

Related questions

Recent Activity

Donate For Us