I have a Spark <code>DataFrame</code> that has one column that has lots of zeros and very few ones (only 0.01% of ones). I'd like to take a random subsample but a stratified one - so that it keeps the ratio of 1s to 0s in that column. Is it possible to do in pyspark ? I am looking for a non-scala solution and on based on <code>DataFrame</code>s and not <code>RDD</code>-based.

The solution I suggested in <a href="https://stackoverflow.com/a/32241887/3415409">Stratified sampling in Spark </a> is pretty straightforward to convert from Scala to Python (or even to Java - What's the easiest way to stratify a Spark Dataset ?). Nevertheless, I'll rewrite it python. Let's start first by creating a toy <code>DataFrame</code> : <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.functions import lit list = [(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)] df = spark.createDataFrame(list, ["x1","x2","x3"]) df.show() # +----------+----------+---+ # | x1| x2| x3| # +----------+----------+---+ # |2147481832| 23355149| 1| # |2147481832| 973010692| 1| # |2147481832|2134870842| 1| # |2147481832| 541023347| 1| # |2147481832|1682206630| 1| # |2147481832|1138211459| 1| # |2147481832| 852202566| 1| # |2147481832| 201375938| 1| # |2147481832| 486538879| 1| # |2147481832| 919187908| 1| # | 214748183| 919187908| 1| # | 214748183| 91187908| 1| # +----------+----------+---+ </code></pre> This <code>DataFrame</code> has 12 elements as you can see : <pre class="prettyprint lang-python prettyprint-override"><code>df.count() # 12 </code></pre> Distributed as followed : <pre class="prettyprint lang-python prettyprint-override"><code>df.groupBy("x1").count().show() # +----------+-----+ # | x1|count| # +----------+-----+ # |2147481832| 10| # | 214748183| 2| # +----------+-----+ </code></pre> Now let's sample : First we'll set the seed : <pre class="prettyprint lang-python prettyprint-override"><code>seed = 12 </code></pre> The find the keys to fraction on and sample : <pre class="prettyprint lang-python prettyprint-override"><code>fractions = df.select("x1").distinct().withColumn("fraction", lit(0.8)).rdd.collectAsMap() print(fractions) # {2147481832: 0.8, 214748183: 0.8} sampled_df = df.stat.sampleBy("x1", fractions, seed) sampled_df.show() # +----------+---------+---+ # | x1| x2| x3| # +----------+---------+---+ # |2147481832| 23355149| 1| # |2147481832|973010692| 1| # |2147481832|541023347| 1| # |2147481832|852202566| 1| # |2147481832|201375938| 1| # |2147481832|486538879| 1| # |2147481832|919187908| 1| # | 214748183|919187908| 1| # | 214748183| 91187908| 1| # +----------+---------+---+ </code></pre> We can now check the content of our sample : <pre class="prettyprint lang-python prettyprint-override"><code>sampled_df.count() # 9 sampled_df.groupBy("x1").count().show() # +----------+-----+ # | x1|count| # +----------+-----+ # |2147481832| 7| # | 214748183| 2| # +----------+-----+ </code></pre>

Assume you have titanic dataset in 'data' dataframe which you want to split into train and test set using stratified sampling based on the 'Survived' target variable. <pre class="prettyprint"><code> # Check initial distributions of 0's and 1's -> data.groupBy("Survived").count().show() Survived|count| +--------+-----+ | 1| 342| | 0| 549 # Taking 70% of both 0's and 1's into training set -> train = data.sampleBy("Survived", fractions={0: 0.7, 1: 0.7}, seed=10) # Subtracting 'train' from original 'data' to get test set -> test = data.subtract(train) # Checking distributions of 0's and 1's in train and test sets after the sampling -> train.groupBy("Survived").count().show() +--------+-----+ |Survived|count| +--------+-----+ | 1| 239| | 0| 399| +--------+-----+ -> test.groupBy("Survived").count().show() +--------+-----+ |Survived|count| +--------+-----+ | 1| 103| | 0| 150| +--------+-----+ </code></pre>

Stratified sampling with pyspark

2 Answers

The solution I suggested in Stratified sampling in Spark is pretty straightforward to convert from Scala to Python (or even to Java - What's the easiest way to stratify a Spark Dataset ?).

Nevertheless, I'll rewrite it python. Let's start first by creating a toy DataFrame :

from pyspark.sql.functions import lit
list = [(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)]
df = spark.createDataFrame(list, ["x1","x2","x3"])
df.show()
# +----------+----------+---+
# |        x1|        x2| x3|
# +----------+----------+---+
# |2147481832|  23355149|  1|
# |2147481832| 973010692|  1|
# |2147481832|2134870842|  1|
# |2147481832| 541023347|  1|
# |2147481832|1682206630|  1|
# |2147481832|1138211459|  1|
# |2147481832| 852202566|  1|
# |2147481832| 201375938|  1|
# |2147481832| 486538879|  1|
# |2147481832| 919187908|  1|
# | 214748183| 919187908|  1|
# | 214748183|  91187908|  1|
# +----------+----------+---+

This DataFrame has 12 elements as you can see :

df.count()
# 12

Distributed as followed :

df.groupBy("x1").count().show()
# +----------+-----+
# |        x1|count|
# +----------+-----+
# |2147481832|   10|
# | 214748183|    2|
# +----------+-----+

Now let's sample :

First we'll set the seed :

seed = 12

The find the keys to fraction on and sample :

fractions = df.select("x1").distinct().withColumn("fraction", lit(0.8)).rdd.collectAsMap()
print(fractions)                                                            
# {2147481832: 0.8, 214748183: 0.8}
sampled_df = df.stat.sampleBy("x1", fractions, seed)
sampled_df.show()
# +----------+---------+---+
# |        x1|       x2| x3|
# +----------+---------+---+
# |2147481832| 23355149|  1|
# |2147481832|973010692|  1|
# |2147481832|541023347|  1|
# |2147481832|852202566|  1|
# |2147481832|201375938|  1|
# |2147481832|486538879|  1|
# |2147481832|919187908|  1|
# | 214748183|919187908|  1|
# | 214748183| 91187908|  1|
# +----------+---------+---+

We can now check the content of our sample :

sampled_df.count()
# 9

sampled_df.groupBy("x1").count().show()
# +----------+-----+
# |        x1|count|
# +----------+-----+
# |2147481832|    7|
# | 214748183|    2|
# +----------+-----+

answered Sep 17 '22 05:09

eliasah

Assume you have titanic dataset in 'data' dataframe which you want to split into train and test set using stratified sampling based on the 'Survived' target variable.

  # Check initial distributions of 0's and 1's
-> data.groupBy("Survived").count().show()

 Survived|count|
 +--------+-----+
 |       1|  342|
 |       0|  549


  # Taking 70% of both 0's and 1's into training set
-> train = data.sampleBy("Survived", fractions={0: 0.7, 1: 0.7}, seed=10)

  # Subtracting 'train' from original 'data' to get test set 
-> test = data.subtract(train)



  # Checking distributions of 0's and 1's in train and test sets after the sampling
-> train.groupBy("Survived").count().show()
+--------+-----+
|Survived|count|
+--------+-----+
|       1|  239|
|       0|  399|
+--------+-----+
-> test.groupBy("Survived").count().show()

+--------+-----+
|Survived|count|
+--------+-----+
|       1|  103|
|       0|  150|
+--------+-----+

answered Sep 19 '22 05:09

Ankit Sharma

Related questions
                            
                                How to check version of Spark and Scala in Zeppelin?
                            
                                ETL in Java Spring Batch vs Apache Spark Benchmarking
                            
                                Modify collection inside a Spark RDD foreach
                            
                                PySpark — UnicodeEncodeError: 'ascii' codec can't encode character
                            
                                Replace missing values with mean - Spark Dataframe
                            
                                Spark-Submit: --packages vs --jars
                            
                                How do you perform basic joins of two RDD tables in Spark using Python?
                            
                                Spark RDD default number of partitions
                            
                                How can I get the current SparkSession in any place of the codes?
                            
                                Not able to import Spark Implicits in ScalaTest
                            
                                How to read only n rows of large CSV file on HDFS using spark-csv package?
                            
                                How to convert column of arrays of strings to strings?
                            
                                setting SparkContext for pyspark
                            
                                pyspark dataframe add a column if it doesn't exist
                            
                                Why is the error "Unable to find encoder for type stored in a Dataset" when encoding JSON using case classes?
                            
                                How to check if list contains all the same values?
                            
                                Show partitions on a pyspark RDD
                            
                                How to resolve external packages with spark-shell when behind a corporate proxy?
                            
                                How to create hive table from Spark data frame, using its schema?
                            
                                How to get the number of elements in partition? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stratified sampling with pyspark

Tags:

apache-spark

apache-spark-sql

pyspark

user3245256

People also ask

2 Answers

eliasah

Ankit Sharma

Recent Activity

Donate For Us