Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random sample in Pyspark without duplicates

Tags:

python

pyspark

I have a Pyspark DataFrame, I want to randomly sample (From anywhere in the entire df) ~100k unique ID's. The DF is transaction based so an ID will appear multiple times, I want to get 100k distinct ID's and then get all the transaction records for each of those ID's from the DF.

I've tried:

sample = df.sample(False, 0.5, 42)
sample = sample.distinct()

Then I'm unsure how to match It back to the original Df, Also some of the ID's are not clean, I want to be able to put some condition in the sample that says the ID must be for example 10 digits.

like image 544
liamod Avatar asked Sep 02 '25 11:09

liamod


2 Answers

df
.where("length(ID) == 10") # only 10 digit ids
.select("ID").distinct()   # you want it unique on id
.sample(False, 0.5, 42)    # now you take the sample
.join(df, "ID")            # and finally join it

Actually not that hard since you already pointed out all the necessary steps.

like image 71
jho Avatar answered Sep 04 '25 01:09

jho


I prefer working with hashes, if I want to make sure to get the same Dataset again and again. It is kind of random as well. With this method you can select X% of your unique ID's, so if you want to have ~100k IDs you need to do some Maths.

import pyspark.sql.functions as F
df = df.wihtColumn("hash", F.hash(F.col("ID")) % 1000) # number between -999 and 999
df = df.filter("hash = 0")

You should check the distribution as well, I think you need to take the absolute value of the hash, because it can be negative as well.

Alternatively:

df = df.wihtColumn("hash", F.abs(F.hash(F.col("ID")) % 1000)) # number between 0 and 999

With this logic, you will get 0,1% of your IDs more or less randomly selected.

like image 27
Christopher Avatar answered Sep 03 '25 23:09

Christopher