I have a Dataframe and wish to divide it into an equal number of rows. In other words, I want a list of dataframes where each one is a disjointed subset of the original dataframe. Let's say the input dataframer is the following: <pre class="prettyprint"><code> +------------------+-----------+-----+--------------------+ | eventName|original_dt|count| features| +------------------+-----------+-----+--------------------+ |15.509775004326936| 0| 100|[15.5097750043269...| |15.509775004326936| 0| 101|[15.5097750043269...| |15.509775004326936| 0| 102|[15.5097750043269...| |15.509775004326936| 0| 103|[15.5097750043269...| |15.509775004326936| 0| 104|[15.5097750043269...| |15.509775004326936| 0| 105|[15.5097750043269...| |15.509775004326936| 0| 106|[15.5097750043269...| |15.509775004326936| 0| 107|[15.5097750043269...| |15.509775004326936| 0| 108|[15.5097750043269...| |15.509775004326936| 0| 109|[15.5097750043269...| |15.509775004326936| 0| 110|[15.5097750043269...| |15.509775004326936| 0| 111|[15.5097750043269...| |15.509775004326936| 0| 112|[15.5097750043269...| |15.509775004326936| 0| 113|[15.5097750043269...| |15.509775004326936| 0| 114|[15.5097750043269...| |15.509775004326936| 0| 115|[15.5097750043269...| | 43.01955000865387| 0| 116|[43.0195500086538...| +------------------+-----------+-----+--------------------+ </code></pre> I wish to split it in K equal sized dataframes. If k = 4, then a possible results would be: <pre class="prettyprint"><code> +------------------+-----------+-----+--------------------+ | eventName|original_dt|count| features| +------------------+-----------+-----+--------------------+ |15.509775004326936| 0| 106|[15.5097750043269...| |15.509775004326936| 0| 107|[15.5097750043269...| |15.509775004326936| 0| 110|[15.5097750043269...| |15.509775004326936| 0| 111|[15.5097750043269...| +------------------+-----------+-----+--------------------+ +------------------+-----------+-----+--------------------+ | eventName|original_dt|count| features| +------------------+-----------+-----+--------------------+ |15.509775004326936| 0| 104|[15.5097750043269...| |15.509775004326936| 0| 108|[15.5097750043269...| |15.509775004326936| 0| 112|[15.5097750043269...| |15.509775004326936| 0| 114|[15.5097750043269...| +------------------+-----------+-----+--------------------+ +------------------+-----------+-----+--------------------+ | eventName|original_dt|count| features| +------------------+-----------+-----+--------------------+ |15.509775004326936| 0| 100|[15.5097750043269...| |15.509775004326936| 0| 105|[15.5097750043269...| |15.509775004326936| 0| 109|[15.5097750043269...| |15.509775004326936| 0| 115|[15.5097750043269...| +------------------+-----------+-----+--------------------+ +------------------+-----------+-----+--------------------+ | eventName|original_dt|count| features| +------------------+-----------+-----+--------------------+ |15.509775004326936| 0| 101|[15.5097750043269...| |15.509775004326936| 0| 102|[15.5097750043269...| |15.509775004326936| 0| 103|[15.5097750043269...| |15.509775004326936| 0| 113|[15.5097750043269...| | 43.01955000865387| 0| 116|[43.0195500086538...| +------------------+-----------+-----+--------------------+ </code></pre>

Another solution is to use limit and except. The following program will return an array with Dataframes that have an equal number of rows. Except the first one that may contain less rows. <pre class="prettyprint"><code>var numberOfNew = 4 var input = List(1,2,3,4,5,6,7,8,9).toDF var newFrames = 0 to numberOfNew map (_ => Seq.empty[Int].toDF) toArray var size = input.count(); val limit = (size / numberOfNew).toInt while (size > 0) { newFrames(numberOfNew) = input.limit(limit) input = input.except(newFrames(numberOfNew)) size = size - limit numberOfNew = numberOfNew - 1 } newFrames.foreach(_.show) +-----+ |value| +-----+ | 7| +-----+ +-----+ |value| +-----+ | 4| | 8| +-----+ +-----+ |value| +-----+ | 5| | 9| +-----+ ... </code></pre>

Spark Scala Split dataframe into equal number of rows

I have a Dataframe and wish to divide it into an equal number of rows.

In other words, I want a list of dataframes where each one is a disjointed subset of the original dataframe.

Let's say the input dataframer is the following:

  +------------------+-----------+-----+--------------------+
  |         eventName|original_dt|count|            features|
  +------------------+-----------+-----+--------------------+
  |15.509775004326936|          0|  100|[15.5097750043269...|
  |15.509775004326936|          0|  101|[15.5097750043269...|
  |15.509775004326936|          0|  102|[15.5097750043269...|
  |15.509775004326936|          0|  103|[15.5097750043269...|
  |15.509775004326936|          0|  104|[15.5097750043269...|
  |15.509775004326936|          0|  105|[15.5097750043269...|
  |15.509775004326936|          0|  106|[15.5097750043269...|
  |15.509775004326936|          0|  107|[15.5097750043269...|
  |15.509775004326936|          0|  108|[15.5097750043269...|
  |15.509775004326936|          0|  109|[15.5097750043269...|
  |15.509775004326936|          0|  110|[15.5097750043269...|
  |15.509775004326936|          0|  111|[15.5097750043269...|
  |15.509775004326936|          0|  112|[15.5097750043269...|
  |15.509775004326936|          0|  113|[15.5097750043269...|
  |15.509775004326936|          0|  114|[15.5097750043269...|
  |15.509775004326936|          0|  115|[15.5097750043269...|
  | 43.01955000865387|          0|  116|[43.0195500086538...|
  +------------------+-----------+-----+--------------------+

I wish to split it in K equal sized dataframes. If k = 4, then a possible results would be:

  +------------------+-----------+-----+--------------------+
  |         eventName|original_dt|count|            features|
  +------------------+-----------+-----+--------------------+
  |15.509775004326936|          0|  106|[15.5097750043269...|
  |15.509775004326936|          0|  107|[15.5097750043269...|
  |15.509775004326936|          0|  110|[15.5097750043269...|
  |15.509775004326936|          0|  111|[15.5097750043269...|
  +------------------+-----------+-----+--------------------+

  +------------------+-----------+-----+--------------------+
  |         eventName|original_dt|count|            features|
  +------------------+-----------+-----+--------------------+
  |15.509775004326936|          0|  104|[15.5097750043269...|
  |15.509775004326936|          0|  108|[15.5097750043269...|
  |15.509775004326936|          0|  112|[15.5097750043269...|
  |15.509775004326936|          0|  114|[15.5097750043269...|
  +------------------+-----------+-----+--------------------+


  +------------------+-----------+-----+--------------------+
  |         eventName|original_dt|count|            features|
  +------------------+-----------+-----+--------------------+
  |15.509775004326936|          0|  100|[15.5097750043269...|
  |15.509775004326936|          0|  105|[15.5097750043269...|
  |15.509775004326936|          0|  109|[15.5097750043269...|
  |15.509775004326936|          0|  115|[15.5097750043269...|
  +------------------+-----------+-----+--------------------+


  +------------------+-----------+-----+--------------------+
  |         eventName|original_dt|count|            features|
  +------------------+-----------+-----+--------------------+
  |15.509775004326936|          0|  101|[15.5097750043269...|
  |15.509775004326936|          0|  102|[15.5097750043269...|
  |15.509775004326936|          0|  103|[15.5097750043269...|
  |15.509775004326936|          0|  113|[15.5097750043269...|
  | 43.01955000865387|          0|  116|[43.0195500086538...|
  +------------------+-----------+-----+--------------------+

What is the difference between == and === in Scala?

For Column: == returns a boolean. === returns a column (which contains the result of the comparisons of the elements of two columns)

How do you split a spark DataFrame into multiple data frames?

Example 1: Split dataframe using 'DataFrame.limit()' We will make use of the split() method to create 'n' equal dataframes. Where, Limits the result count to the number specified.

Does spark union drop duplicates?

UNION (alternatively, UNION DISTINCT ) takes only distinct rows while UNION ALL does not remove duplicates from the result rows.

How many rows of data can spark handle?

By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.

Another solution is to use limit and except. The following program will return an array with Dataframes that have an equal number of rows. Except the first one that may contain less rows.

var numberOfNew = 4
var input = List(1,2,3,4,5,6,7,8,9).toDF
var newFrames = 0 to numberOfNew map (_ => Seq.empty[Int].toDF) toArray
var size = input.count();
val limit = (size / numberOfNew).toInt

while (size > 0) {
    newFrames(numberOfNew) = input.limit(limit)
    input = input.except(newFrames(numberOfNew))
    size = size - limit
    numberOfNew = numberOfNew - 1
}

newFrames.foreach(_.show)

+-----+
|value|
+-----+
|    7|
+-----+

+-----+
|value|
+-----+
|    4|
|    8|
+-----+

+-----+
|value|
+-----+
|    5|
|    9|
+-----+

...

Spark Scala Split dataframe into equal number of rows

Tags:

dataframe

scala

apache-spark

Alessandro La Corte

People also ask

1 Answers

Steffen Schmitz

Recent Activity

Donate For Us

Spark Scala Split dataframe into equal number of rows

Tags:

dataframe

scala

apache-spark

Alessandro La Corte

People also ask

1 Answers

Steffen Schmitz

Related questions

Recent Activity

Donate For Us