sampling with weight using pyspark

Question

I have an unbalanced dataframe on spark using PySpark. I want to resample it to make it balanced. I only find the sample function in PySpark

sample(withReplacement, fraction, seed=None)

but I want to sample the dataframe with weight of unitvolume in Python, I can do it like

df.sample(n,Flase,weights=log(unitvolume))

is there any method I could do the same using PySpark?

Alper t. Turker · Accepted Answer

Spark provides tools for stratified sampling, but this work only on categorical data. You could try to bucketize it:

from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import col, log

df_log = df.withColumn("log_unitvolume", log(col("unitvolume"))
splits = ... # A list of splits

bucketizer = Bucketizer(splits=splits, inputCol="log_unitvolume", outputCol="bucketed_log_unitvolume")

df_log_bucketed = bucketizer.transform(df_log)

Compute statistics:

counts = df.groupBy("bucketed_log_unitvolume")
fractions  = ...  # Define fractions from each bucket:

and use these for sampling:

df_log_bucketed.sampleBy("bucketed_log_unitvolume", fractions)

You can also try to rescale log_unitvolume to [0, 1] range and then:

from pyspark.sql.functions import rand 

df_log_rescaled.where(col("log_unitvolume_rescaled") < rand())

sampling with weight using pyspark

Tags:

python

apache-spark

pyspark

sampling

Xin Chang

1 Answers

Alper t. Turker

Recent Activity

Donate For Us

sampling with weight using pyspark

Tags:

python

apache-spark

pyspark

sampling

Xin Chang

1 Answers

Alper t. Turker

Related questions

Recent Activity

Donate For Us