Random sample in Pyspark without duplicates

Question

I have a Pyspark DataFrame, I want to randomly sample (From anywhere in the entire df) ~100k unique ID's. The DF is transaction based so an ID will appear multiple times, I want to get 100k distinct ID's and then get all the transaction records for each of those ID's from the DF.

I've tried:

sample = df.sample(False, 0.5, 42)
sample = sample.distinct()

Then I'm unsure how to match It back to the original Df, Also some of the ID's are not clean, I want to be able to put some condition in the sample that says the ID must be for example 10 digits.

jho · Accepted Answer

df
.where("length(ID) == 10") # only 10 digit ids
.select("ID").distinct()   # you want it unique on id
.sample(False, 0.5, 42)    # now you take the sample
.join(df, "ID")            # and finally join it

Actually not that hard since you already pointed out all the necessary steps.

Christopher · Answer

I prefer working with hashes, if I want to make sure to get the same Dataset again and again. It is kind of random as well. With this method you can select X% of your unique ID's, so if you want to have ~100k IDs you need to do some Maths.

import pyspark.sql.functions as F
df = df.wihtColumn("hash", F.hash(F.col("ID")) % 1000) # number between -999 and 999
df = df.filter("hash = 0")

You should check the distribution as well, I think you need to take the absolute value of the hash, because it can be negative as well.

Alternatively:

df = df.wihtColumn("hash", F.abs(F.hash(F.col("ID")) % 1000)) # number between 0 and 999

With this logic, you will get 0,1% of your IDs more or less randomly selected.

Random sample in Pyspark without duplicates

Tags:

python

pyspark

liamod

2 Answers

jho

Christopher

Recent Activity

Donate For Us

Random sample in Pyspark without duplicates

Tags:

python

pyspark

liamod

2 Answers

jho

Christopher

Related questions

Recent Activity

Donate For Us