Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to bin in PySpark?

For example, I'd like to classify a DataFrame of people into the following 4 bins according to age.

age_bins = [0, 6, 18, 60, np.Inf] age_labels = ['infant', 'minor', 'adult', 'senior'] 

I would use pandas.cut() to do this in pandas. How do I do this in PySpark?

like image 841
ceiling cat Avatar asked Sep 14 '17 17:09

ceiling cat


People also ask

How do I set default value in Pyspark?

Method 2: Using pyspark.sql.DataFrame.select(*cols)select() create a new column in DataFrame and set it to default values.

How do you create a decile in Pyspark?

By doing so you first compute the percent_rank, and then you multiply this by 10 and take the upper integer. Consequently, all values with a percent_rank between 0 and 0.1 will be added to decile 1, all values with a percent_rank between 0.1 and 0.2 will be added to decile 2, etc.

What does .collect do in Pyspark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.

What is withColumn Pyspark?

PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.


1 Answers

You can use Bucketizer feature transfrom from ml library in spark.

values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)]   df = spark.createDataFrame(values, ["name", "ages"])   from pyspark.ml.feature import Bucketizer bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets") df_buck = bucketizer.setHandleInvalid("keep").transform(df)  df_buck.show() 

output

+----+----+-------+ |name|ages|buckets| +----+----+-------+ |   a|  23|    2.0| |   b|  45|    2.0| |   c|  10|    1.0| |   d|  60|    3.0| |   e|  56|    2.0| |   f|   2|    0.0| |   g|  25|    2.0| |   h|  40|    2.0| |   j|  33|    2.0| +----+----+-------+ 

If you want names for each bucket you can use udf to create a new column with bucket names

from pyspark.sql.functions import udf from pyspark.sql.types import *  t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"} udf_foo = udf(lambda x: t[x], StringType()) df_buck.withColumn("age_bucket", udf_foo("buckets")).show() 

output

+----+----+-------+----------+ |name|ages|buckets|age_bucket| +----+----+-------+----------+ |   a|  23|    2.0|     adult| |   b|  45|    2.0|     adult| |   c|  10|    1.0|     minor| |   d|  60|    3.0|    senior| |   e|  56|    2.0|     adult| |   f|   2|    0.0|    infant| |   g|  25|    2.0|     adult| |   h|  40|    2.0|     adult| |   j|  33|    2.0|     adult| +----+----+-------+----------+ 
like image 92
pauli Avatar answered Oct 05 '22 03:10

pauli