I have many columns that I'm performing joins on which can sometimes contain billions of rows of null values, so I would like to salt the columns to prevent skew after the join like mentioned in Jason Evan's post: https://stackoverflow.com/a/43394695
I cannot find an equivalent example of this in Python, and the syntax is just different enough that I can't figure out how to translate it.
I approximately have this:
import pyspark.sql.functions as psf
big_neg = -200
for column in key_fields: #key_fields is a list of join keys in the dataframe
df = df.withColumn(column,
psf.when(psf.col(column).isNull(),
psf.round(psf.rand().multiply(big_neg))
).otherwise(df[column]))
This is currently failing on a syntax error:
TypeError: 'Column' object is not callable
but I have already tried many syntax combinations to get rid of the typeError and am stumped.
I was actually able to figure it out after taking a break.
I thought it would be helpful for anyone else who encounters this problem, so I will post my solution:
df = df.withColumn(column, psf.when(df[column].isNull(), psf.round(psf.rand()*(big_neg))).otherwise(df[column]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With