I have a Pyspark DataFrame, I want to randomly sample (From anywhere in the entire df) ~100k unique ID's. The DF is transaction based so an ID will appear multiple times, I want to get 100k distinct ID's and then get all the transaction records for each of those ID's from the DF.
I've tried:
sample = df.sample(False, 0.5, 42)
sample = sample.distinct()
Then I'm unsure how to match It back to the original Df, Also some of the ID's are not clean, I want to be able to put some condition in the sample that says the ID must be for example 10 digits.
df
.where("length(ID) == 10") # only 10 digit ids
.select("ID").distinct() # you want it unique on id
.sample(False, 0.5, 42) # now you take the sample
.join(df, "ID") # and finally join it
Actually not that hard since you already pointed out all the necessary steps.
I prefer working with hashes, if I want to make sure to get the same Dataset again and again. It is kind of random as well. With this method you can select X% of your unique ID's, so if you want to have ~100k IDs you need to do some Maths.
import pyspark.sql.functions as F
df = df.wihtColumn("hash", F.hash(F.col("ID")) % 1000) # number between -999 and 999
df = df.filter("hash = 0")
You should check the distribution as well, I think you need to take the absolute value of the hash, because it can be negative as well.
Alternatively:
df = df.wihtColumn("hash", F.abs(F.hash(F.col("ID")) % 1000)) # number between 0 and 999
With this logic, you will get 0,1% of your IDs more or less randomly selected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With