Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way to get the output of Spark's Dataset.show() method as a string?

The Spark Dataset.show() method is useful for seeing the contents of a dataset, particularly for debugging (it prints out a nicely-formatted table). As far as I can tell, it only prints to the console, but it would be useful to be able to get this as a string. For example, it would be nice to be able to write it to a log, or see it as the result of an expression when debugging with, say, IntelliJ.

Is there any way to get the output of Dataset.show() as a string?

like image 918
Jason Evans Avatar asked Aug 17 '17 16:08

Jason Evans


People also ask

What does show () do in PySpark?

Spark show() – Display DataFrame Contents in Table. Spark DataFrame show() is used to display the contents of the DataFrame in a Table Row & Column Format. By default, it shows only 20 Rows and the column values are truncated at 20 characters.

What does RDD collect () return?

collect. Return a list that contains all of the elements in this RDD. This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.

Which method can be used to convert a Spark Dataset to a DataFrame?

Methods for creating Spark DataFrame 1. Create a list and parse it as a DataFrame using the toDataFrame() method from the SparkSession . 2. Convert an RDD to a DataFrame using the toDF() method.

What is DF collect ()?

collect() action function is used to retrieve all elements from the dataset (RDD/DataFrame/Dataset) as a Array[Row] to the driver program.


1 Answers

The corresponding method behind show isn't visible from outside the sql package. I've taken the corresponding method and changed it such that a dataframe can be passed as parameter (code taken from Dataset.scala) :

  def showString(df:DataFrame,_numRows: Int = 20, truncate: Int = 20): String = {
    val numRows = _numRows.max(0)
    val takeResult = df.take(numRows + 1)
    val hasMoreData = takeResult.length > numRows
    val data = takeResult.take(numRows)

    // For array values, replace Seq and Array with square brackets
    // For cells that are beyond `truncate` characters, replace it with the
    // first `truncate-3` and "..."
    val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
      row.toSeq.map { cell =>
        val str = cell match {
          case null => "null"
          case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
          case array: Array[_] => array.mkString("[", ", ", "]")
          case seq: Seq[_] => seq.mkString("[", ", ", "]")
          case _ => cell.toString
        }
        if (truncate > 0 && str.length > truncate) {
          // do not show ellipses for strings shorter than 4 characters.
          if (truncate < 4) str.substring(0, truncate)
          else str.substring(0, truncate - 3) + "..."
        } else {
          str
        }
      }: Seq[String]
    }

    val sb = new StringBuilder
    val numCols = df.schema.fieldNames.length

    // Initialise the width of each column to a minimum value of '3'
    val colWidths = Array.fill(numCols)(3)

    // Compute the width of each column
    for (row <- rows) {
      for ((cell, i) <- row.zipWithIndex) {
        colWidths(i) = math.max(colWidths(i), cell.length)
      }
    }

    // Create SeparateLine
    val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString()

    // column names
    rows.head.zipWithIndex.map { case (cell, i) =>
      if (truncate > 0) {
        StringUtils.leftPad(cell, colWidths(i))
      } else {
        StringUtils.rightPad(cell, colWidths(i))
      }
    }.addString(sb, "|", "|", "|\n")

    sb.append(sep)

    // data
    rows.tail.map {
      _.zipWithIndex.map { case (cell, i) =>
        if (truncate > 0) {
          StringUtils.leftPad(cell.toString, colWidths(i))
        } else {
          StringUtils.rightPad(cell.toString, colWidths(i))
        }
      }.addString(sb, "|", "|", "|\n")
    }

    sb.append(sep)

    // For Data that has more than "numRows" records
    if (hasMoreData) {
      val rowsString = if (numRows == 1) "row" else "rows"
      sb.append(s"only showing top $numRows $rowsString\n")
    }

    sb.toString()
  }
like image 78
Raphael Roth Avatar answered Oct 26 '22 23:10

Raphael Roth