I am having a pyspark dataframe as
DOCTOR | PATIENT
JOHN | SAM
JOHN | PETER
JOHN | ROBIN
BEN | ROSE
BEN | GRAY
and need to concatenate patient names by rows so that I get the output like:
DOCTOR | PATIENT
JOHN | SAM, PETER, ROBIN
BEN | ROSE, GRAY
Can anybody help me regarding creating this dataframe in pyspark ?
Thanks in advance.
PySpark Concatenate Using concat() concat() function of Pyspark SQL is used to concatenate multiple DataFrame columns into a single column. It can also be used to concatenate column types string, binary, and compatible array columns.
concat() will join two or more columns in the given PySpark DataFrame and add these values into a new column. By using the select() method, we can view the column concatenated, and by using an alias() method, we can name the concatenated column.
The simplest way I can think of is to use collect_list
import pyspark.sql.functions as f
df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))
import pyspark.sql.functions as f
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
data = [
("U_104", "food"),
("U_103", "cosmetics"),
("U_103", "children"),
("U_104", "groceries"),
("U_103", "food")
]
schema = StructType([
StructField("user_id", StringType(), True),
StructField("category", StringType(), True),
])
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("groupby").getOrCreate()
df = spark.createDataFrame(data, schema)
group_df = df.groupBy(f.col("user_id")).agg(
f.concat_ws(",", f.collect_list(f.col("category"))).alias("categories")
)
group_df.show()
+-------+--------------------+
|user_id| categories|
+-------+--------------------+
| U_104| food,groceries|
| U_103|cosmetics,childre...|
+-------+--------------------+
There are some useful aggregation examples
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With