I am trying to bucketize columns that contain the word "road" in a 5k dataset. And create a new dataframe.
I am not sure how to do that, here is what I have tried far :
from pyspark.ml.feature import Bucketizer
spike_cols = [col for col in df.columns if "road" in col]
for x in spike_cols :
bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket")
bucketedData = bucketizer.transform(df)
Either modify df
in the loop:
from pyspark.ml.feature import Bucketizer
for x in spike_cols :
bucketizer = Bucketizer(splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket")
df = bucketizer.transform(df)
or use Pipeline
:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Bucketizer
model = Pipeline(stages=[
Bucketizer(
splits=[-float("inf"), 10, 100, float("inf")],
inputCol=x, outputCol=x + "bucket") for x in spike_cols
]).fit(df)
model.transform(df)
Since 3.0.0, Bucketizer can map multiple columns at once by setting the inputCols parameter.
So this became easier:
from pyspark.ml.feature import Bucketizer
splits = [-float("inf"), 10, 100, float("inf")]
params = [(col, col+'bucket', splits) for col in df.columns if "road" in col]
input_cols, output_cols, splits_array = zip(*params)
bucketizer = Bucketizer(inputCols=input_cols, outputCols=output_cols,
splitsArray=splits_array)
bucketedData = bucketizer.transform(df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With