Is there any way to get the output of Spark's Dataset.show() method as a string?

Tags:

The Spark Dataset.show() method is useful for seeing the contents of a dataset, particularly for debugging (it prints out a nicely-formatted table). As far as I can tell, it only prints to the console, but it would be useful to be able to get this as a string. For example, it would be nice to be able to write it to a log, or see it as the result of an expression when debugging with, say, IntelliJ.

Is there any way to get the output of Dataset.show() as a string?

918

asked Aug 17 '17 16:08

Jason Evans

1 Answers

The corresponding method behind show isn't visible from outside the sql package. I've taken the corresponding method and changed it such that a dataframe can be passed as parameter (code taken from Dataset.scala) :

Click to copy

  def showString(df:DataFrame,_numRows: Int = 20, truncate: Int = 20): String = {
    val numRows = _numRows.max(0)
    val takeResult = df.take(numRows + 1)
    val hasMoreData = takeResult.length > numRows
    val data = takeResult.take(numRows)

    // For array values, replace Seq and Array with square brackets
    // For cells that are beyond `truncate` characters, replace it with the
    // first `truncate-3` and "..."
    val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
      row.toSeq.map { cell =>
        val str = cell match {
          case null => "null"
          case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
          case array: Array[_] => array.mkString("[", ", ", "]")
          case seq: Seq[_] => seq.mkString("[", ", ", "]")
          case _ => cell.toString
        }
        if (truncate > 0 && str.length > truncate) {
          // do not show ellipses for strings shorter than 4 characters.
          if (truncate < 4) str.substring(0, truncate)
          else str.substring(0, truncate - 3) + "..."
        } else {
          str
        }
      }: Seq[String]
    }

    val sb = new StringBuilder
    val numCols = df.schema.fieldNames.length

    // Initialise the width of each column to a minimum value of '3'
    val colWidths = Array.fill(numCols)(3)

    // Compute the width of each column
    for (row <- rows) {
      for ((cell, i) <- row.zipWithIndex) {
        colWidths(i) = math.max(colWidths(i), cell.length)
      }
    }

    // Create SeparateLine
    val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString()

    // column names
    rows.head.zipWithIndex.map { case (cell, i) =>
      if (truncate > 0) {
        StringUtils.leftPad(cell, colWidths(i))
      } else {
        StringUtils.rightPad(cell, colWidths(i))
      }
    }.addString(sb, "|", "|", "|\n")

    sb.append(sep)

    // data
    rows.tail.map {
      _.zipWithIndex.map { case (cell, i) =>
        if (truncate > 0) {
          StringUtils.leftPad(cell.toString, colWidths(i))
        } else {
          StringUtils.rightPad(cell.toString, colWidths(i))
        }
      }.addString(sb, "|", "|", "|\n")
    }

    sb.append(sep)

    // For Data that has more than "numRows" records
    if (hasMoreData) {
      val rowsString = if (numRows == 1) "row" else "rows"
      sb.append(s"only showing top $numRows $rowsString\n")
    }

    sb.toString()
  }

answered Oct 26 '22 23:10

Raphael Roth

Related questions
                            
                                Why spark executor cores are not equal with active tasks in spark web UI？
                            
                                The group member's supported protocols are incompatible with those of existing members
                            
                                How can I convince spark not to make an exchange when the join key is a super-set of the bucketBy key?
                            
                                Can AWS Glue crawl Delta Lake table data?
                            
                                Spark atop of Docker not accepting jobs
                            
                                Why does Spark shuffle store intermediate data on disk?
                            
                                Get all Apache Spark executor logs
                            
                                HashMap as a Broadcast Variable in Spark Streaming?
                            
                                run reduceByKey on huge data in spark
                            
                                Unable to submit Spring boot java application to Spark cluster
                            
                                Write and run pyspark in IntelliJ IDEA
                            
                                Spark Scala filter DataFrame where value not in another DataFrame
                            
                                TypeError: 'JavaPackage' object is not callable
                            
                                Spark Dataset and java.sql.Date
                            
                                Spark pulling data into RDD or dataframe or dataset
                            
                                Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows
                            
                                Spark error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                            
                                Spark is inventing his own AWS secretKey
                            
                                Yarn slave nodes are not communicating with master node?
                            
                                Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there any way to get the output of Spark's Dataset.show() method as a string?

Tags:

apache-spark

apache-spark-sql

Jason Evans

People also ask

1 Answers

Raphael Roth

Recent Activity

Donate For Us