Using Scala, how can I split dataFrame into multiple dataFrame (be it array or collection) with same column value. For example I want to split the following DataFrame: <pre class="prettyprint"><code>ID Rate State 1 24 AL 2 35 MN 3 46 FL 4 34 AL 5 78 MN 6 99 FL </code></pre> to: data set 1 <pre class="prettyprint"><code>ID Rate State 1 24 AL 4 34 AL </code></pre> data set 2 <pre class="prettyprint"><code>ID Rate State 2 35 MN 5 78 MN </code></pre> data set 3 <pre class="prettyprint"><code>ID Rate State 3 46 FL 6 99 FL </code></pre>

You can collect unique state values and simply map over resulting array: <pre class="prettyprint"><code>val states = df.select("State").distinct.collect.flatMap(_.toSeq) val byStateArray = states.map(state => df.where($"State" <=> state)) </code></pre> or to map: <pre class="prettyprint"><code>val byStateMap = states .map(state => (state -> df.where($"State" <=> state))) .toMap </code></pre> The same thing in Python: <pre class="prettyprint lang-py prettyprint-override"><code>from itertools import chain from pyspark.sql.functions import col states = chain(*df.select("state").distinct().collect()) # PySpark 2.3 and later # In 2.2 and before col("state") == state) # should give the same outcome, ignoring NULLs # if NULLs are important # (lit(state).isNull() & col("state").isNull()) | (col("state") == state) df_by_state = {state: df.where(col("state").eqNullSafe(state)) for state in states} </code></pre> The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation. If you're looking for a way to just split the output see also How do I split an RDD into two or more RDDs? In particular you can write <code>Dataset</code> partitioned by the column of interest: <pre class="prettyprint"><code>val path: String = ??? df.write.partitionBy("State").parquet(path) </code></pre> and read back if needed: <pre class="prettyprint"><code>// Depend on partition prunning for { state <- states } yield spark.read.parquet(path).where($"State" === state) // or explicitly read the partition for { state <- states } yield spark.read.parquet(s"$path/State=$state") </code></pre> Depending on the size of the data, number of levels of the splitting, storag and persistence level of the input it might faster or slower than multiple filters.

How to split a dataframe into dataframes with same column values?

Tags:

dataframe

scala

apache-spark

apache-spark-sql

Using Scala, how can I split dataFrame into multiple dataFrame (be it array or collection) with same column value. For example I want to split the following DataFrame:

Click to copy

ID  Rate    State
1   24  AL
2   35  MN
3   46  FL
4   34  AL
5   78  MN
6   99  FL

to:

data set 1

Click to copy

ID  Rate    State
1   24  AL  
4   34  AL

data set 2

Click to copy

ID  Rate    State
2   35  MN
5   78  MN

data set 3

Click to copy

ID  Rate    State
3   46  FL
6   99  FL

923

asked Jul 28 '15 06:07

user1735076

2 Answers

You can collect unique state values and simply map over resulting array:

Click to copy

val states = df.select("State").distinct.collect.flatMap(_.toSeq)
val byStateArray = states.map(state => df.where($"State" <=> state))

or to map:

Click to copy

val byStateMap = states
    .map(state => (state -> df.where($"State" <=> state)))
    .toMap

The same thing in Python:

Click to copy

from itertools import chain
from pyspark.sql.functions import col

states = chain(*df.select("state").distinct().collect())

# PySpark 2.3 and later
# In 2.2 and before col("state") == state) 
# should give the same outcome, ignoring NULLs 
# if NULLs are important 
# (lit(state).isNull() & col("state").isNull()) | (col("state") == state)
df_by_state = {state: 
  df.where(col("state").eqNullSafe(state)) for state in states}

The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation. If you're looking for a way to just split the output see also How do I split an RDD into two or more RDDs?

In particular you can write Dataset partitioned by the column of interest:

Click to copy

val path: String = ???
df.write.partitionBy("State").parquet(path)

and read back if needed:

Click to copy

// Depend on partition prunning
for { state <- states } yield spark.read.parquet(path).where($"State" === state)

// or explicitly read the partition
for { state <- states } yield spark.read.parquet(s"$path/State=$state")

Depending on the size of the data, number of levels of the splitting, storag and persistence level of the input it might faster or slower than multiple filters.

answered Oct 16 '22 08:10

zero323

It is very simple (if the spark version is 2) if you make the dataframe as a temporary table.

Click to copy

df1.createOrReplaceTempView("df1")

And now you can do the queries,

Click to copy

var df2 = spark.sql("select * from df1 where state = 'FL'")
var df3 = spark.sql("select * from df1 where state = 'MN'")
var df4 = spark.sql("select * from df1 where state = 'AL'")

Now you got the df2, df3, df4. If you want to have them as list, you can use,

Click to copy

df2.collect()
df3.collect()

or even map/filter function. Please refer https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes

Ash

answered Oct 16 '22 07:10

ashK

Related questions
                            
                                How to create a Spark Dataset from an RDD
                            
                                How to name aggregate columns?
                            
                                when to use Classes vs Objects vs Case Classes vs Traits
                            
                                foreach loop in scala
                            
                                What are type projections useful for?
                            
                                Using colors in Scala Console
                            
                                Passing Arguments in Apache Spark
                            
                                Scala Map vs HashMap
                            
                                How to set default value and also use Environment variable in application.conf
                            
                                How to write dynamic SQL queries with sql""" interpolation in slick
                            
                                SBT: is there a way to print the list of resolvers?
                            
                                Is it possible for an optional argument value to depend on another argument in Scala
                            
                                Why the "hello, world" is not output to the console?
                            
                                Do I need to import members of a singleton object into its companion class in Scala?
                            
                                Can we define a higher-kinded type-level identity function in Scala?
                            
                                Best practice for use constants in scala annotations
                            
                                Constructor-local variables in Scala
                            
                                How to use a case classes when hierarchy is needed?
                            
                                What's the difference between currying and multiple parameter lists?
                            
                                Is there a 'lazy map'?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split a dataframe into dataframes with same column values?

Tags:

dataframe

scala

apache-spark

apache-spark-sql

user1735076

People also ask

2 Answers

zero323

ashK

Recent Activity

Donate For Us