Below is the spark scala code which will print one column DataSet[Row]:
import org.apache.spark.sql.{Dataset, Row, SparkSession}
val spark: SparkSession = SparkSession.builder()
.appName("Spark DataValidation")
.config("SPARK_MAJOR_VERSION", "2").enableHiveSupport()
.getOrCreate()
val kafkaPath:String="hdfs:///landing/APPLICATION/*"
val targetPath:String="hdfs://datacompare/3"
val pk:String = "APPLICATION_ID"
val pkValues = spark
.read
.json(kafkaPath)
.select("message.data.*")
.select(pk)
.distinct()
pkValues.show()
Output of about code :
+--------------+
|APPLICATION_ID|
+--------------+
| 388|
| 447|
| 346|
| 861|
| 361|
| 557|
| 482|
| 518|
| 432|
| 422|
| 533|
| 733|
| 472|
| 457|
| 387|
| 394|
| 786|
| 458|
+--------------+
Question :
How to convert this dataframe to comma separated String variable ?
Expected output :
val data:String= "388,447,346,861,361,557,482,518,432,422,533,733,472,457,387,394,786,458"
Please suggest how to convert DataFrame[Row] or Dataset to one String .
I don't think that's a good idea, since a dataFrame is a distributed object and can be inmense. Collect
will bring all the data to the driver, so you should perform this kind operation carefully.
Here is what you can do with a dataFrame (two options):
df.select("APPLICATION_ID").rdd.map(r => r(0)).collect.mkString(",")
df.select("APPLICATION_ID").collect.mkString(",")
Result with a test dataFrame with only 3 rows:
String = 388,447,346
Edit: With DataSet you can do directly:
ds.collect.mkString(",")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With