I have a Dataframe and wish to divide it into an equal number of rows.
In other words, I want a list of dataframes where each one is a disjointed subset of the original dataframe.
Let's say the input dataframer is the following:
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 100|[15.5097750043269...|
|15.509775004326936| 0| 101|[15.5097750043269...|
|15.509775004326936| 0| 102|[15.5097750043269...|
|15.509775004326936| 0| 103|[15.5097750043269...|
|15.509775004326936| 0| 104|[15.5097750043269...|
|15.509775004326936| 0| 105|[15.5097750043269...|
|15.509775004326936| 0| 106|[15.5097750043269...|
|15.509775004326936| 0| 107|[15.5097750043269...|
|15.509775004326936| 0| 108|[15.5097750043269...|
|15.509775004326936| 0| 109|[15.5097750043269...|
|15.509775004326936| 0| 110|[15.5097750043269...|
|15.509775004326936| 0| 111|[15.5097750043269...|
|15.509775004326936| 0| 112|[15.5097750043269...|
|15.509775004326936| 0| 113|[15.5097750043269...|
|15.509775004326936| 0| 114|[15.5097750043269...|
|15.509775004326936| 0| 115|[15.5097750043269...|
| 43.01955000865387| 0| 116|[43.0195500086538...|
+------------------+-----------+-----+--------------------+
I wish to split it in K equal sized dataframes. If k = 4, then a possible results would be:
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 106|[15.5097750043269...|
|15.509775004326936| 0| 107|[15.5097750043269...|
|15.509775004326936| 0| 110|[15.5097750043269...|
|15.509775004326936| 0| 111|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 104|[15.5097750043269...|
|15.509775004326936| 0| 108|[15.5097750043269...|
|15.509775004326936| 0| 112|[15.5097750043269...|
|15.509775004326936| 0| 114|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 100|[15.5097750043269...|
|15.509775004326936| 0| 105|[15.5097750043269...|
|15.509775004326936| 0| 109|[15.5097750043269...|
|15.509775004326936| 0| 115|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 101|[15.5097750043269...|
|15.509775004326936| 0| 102|[15.5097750043269...|
|15.509775004326936| 0| 103|[15.5097750043269...|
|15.509775004326936| 0| 113|[15.5097750043269...|
| 43.01955000865387| 0| 116|[43.0195500086538...|
+------------------+-----------+-----+--------------------+
For Column: == returns a boolean. === returns a column (which contains the result of the comparisons of the elements of two columns)
Example 1: Split dataframe using 'DataFrame.limit()' We will make use of the split() method to create 'n' equal dataframes. Where, Limits the result count to the number specified.
UNION (alternatively, UNION DISTINCT ) takes only distinct rows while UNION ALL does not remove duplicates from the result rows.
By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.
Another solution is to use limit and except. The following program will return an array with Dataframes that have an equal number of rows. Except the first one that may contain less rows.
var numberOfNew = 4
var input = List(1,2,3,4,5,6,7,8,9).toDF
var newFrames = 0 to numberOfNew map (_ => Seq.empty[Int].toDF) toArray
var size = input.count();
val limit = (size / numberOfNew).toInt
while (size > 0) {
newFrames(numberOfNew) = input.limit(limit)
input = input.except(newFrames(numberOfNew))
size = size - limit
numberOfNew = numberOfNew - 1
}
newFrames.foreach(_.show)
+-----+
|value|
+-----+
| 7|
+-----+
+-----+
|value|
+-----+
| 4|
| 8|
+-----+
+-----+
|value|
+-----+
| 5|
| 9|
+-----+
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With