Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Attach column names to elements with Spark and Scala using FlatMap

For a given table like

+--+--+
| A| B|
+--+--+
|aa|bb|
|cc|dd|
+--+--+

I want to get a dataframe like:

+---+---+
|._1|._2|
+---+---+
|aa | A |
|bb | B |
|cc | A |
|dd | B |
+---+---+

using Apache Spark and Scala. So basically I want tuples that have the original values at index 0 and the column name at index 1. This should work for any arbitrary schema. That means that the amount of columns is not known beforehands and as far as I understand I therefore cannot cast to datasets. This is how I tried to solve it:

val df = spark.read
          .option("header", "true")
          .option("sep",";")
          .csv(path + "/tpch_nation.csv")
val cells = df.flatMap(tuple => {
          tuple.toSeq.asInstanceOf[Seq[String]].zip(df.columns.toList)
        })
cells.show()

However, this produces an java.lang.NullPointerException inside the flatMap function. I am quite puzzled which object points to Null, and how I could solve the problem.

like image 546
ElectricWizard Avatar asked Nov 06 '25 20:11

ElectricWizard


1 Answers

Don't use df in the closure. Use columns separately

val columns = df.columns 

val cells = df.flatMap(row => {
  row.toSeq.map(_.toString).zip(columns)
})

or don't use at all:

val cells = df.flatMap(row => {
  row.toSeq.map(_.toString).zip(row.schema.fieldNames)
})

Also:

  • Transpose column to row with Spark
  • unpivot in spark-sql/pyspark
  • Spark DataFrame column names not passed to slave nodes?
like image 182
Alper t. Turker Avatar answered Nov 09 '25 23:11

Alper t. Turker



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!