For example, I'd like to classify a <code>DataFrame</code> of people into the following 4 bins according to age. <pre class="prettyprint"><code>age_bins = [0, 6, 18, 60, np.Inf] age_labels = ['infant', 'minor', 'adult', 'senior'] </code></pre> I would use <code>pandas.cut()</code> to do this in <code>pandas</code>. How do I do this in <code>PySpark</code>?

You can use Bucketizer feature transfrom from ml library in spark. <pre class="prettyprint"><code>values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)] df = spark.createDataFrame(values, ["name", "ages"]) from pyspark.ml.feature import Bucketizer bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets") df_buck = bucketizer.setHandleInvalid("keep").transform(df) df_buck.show() </code></pre> output <pre class="prettyprint"><code>+----+----+-------+ |name|ages|buckets| +----+----+-------+ | a| 23| 2.0| | b| 45| 2.0| | c| 10| 1.0| | d| 60| 3.0| | e| 56| 2.0| | f| 2| 0.0| | g| 25| 2.0| | h| 40| 2.0| | j| 33| 2.0| +----+----+-------+ </code></pre> If you want names for each bucket you can use udf to create a new column with bucket names <pre class="prettyprint"><code>from pyspark.sql.functions import udf from pyspark.sql.types import * t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"} udf_foo = udf(lambda x: t[x], StringType()) df_buck.withColumn("age_bucket", udf_foo("buckets")).show() </code></pre> output <pre class="prettyprint"><code>+----+----+-------+----------+ |name|ages|buckets|age_bucket| +----+----+-------+----------+ | a| 23| 2.0| adult| | b| 45| 2.0| adult| | c| 10| 1.0| minor| | d| 60| 3.0| senior| | e| 56| 2.0| adult| | f| 2| 0.0| infant| | g| 25| 2.0| adult| | h| 40| 2.0| adult| | j| 33| 2.0| adult| +----+----+-------+----------+ </code></pre>

How to bin in PySpark?

Tags:

apache-spark

pyspark

For example, I'd like to classify a DataFrame of people into the following 4 bins according to age.

age_bins = [0, 6, 18, 60, np.Inf] age_labels = ['infant', 'minor', 'adult', 'senior']

I would use pandas.cut() to do this in pandas. How do I do this in PySpark?

841

asked Sep 14 '17 17:09

ceiling cat

1 Answers

You can use Bucketizer feature transfrom from ml library in spark.

values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)]   df = spark.createDataFrame(values, ["name", "ages"])   from pyspark.ml.feature import Bucketizer bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets") df_buck = bucketizer.setHandleInvalid("keep").transform(df)  df_buck.show()

output

+----+----+-------+ |name|ages|buckets| +----+----+-------+ |   a|  23|    2.0| |   b|  45|    2.0| |   c|  10|    1.0| |   d|  60|    3.0| |   e|  56|    2.0| |   f|   2|    0.0| |   g|  25|    2.0| |   h|  40|    2.0| |   j|  33|    2.0| +----+----+-------+

If you want names for each bucket you can use udf to create a new column with bucket names

from pyspark.sql.functions import udf from pyspark.sql.types import *  t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"} udf_foo = udf(lambda x: t[x], StringType()) df_buck.withColumn("age_bucket", udf_foo("buckets")).show()

output

+----+----+-------+----------+ |name|ages|buckets|age_bucket| +----+----+-------+----------+ |   a|  23|    2.0|     adult| |   b|  45|    2.0|     adult| |   c|  10|    1.0|     minor| |   d|  60|    3.0|    senior| |   e|  56|    2.0|     adult| |   f|   2|    0.0|    infant| |   g|  25|    2.0|     adult| |   h|  40|    2.0|     adult| |   j|  33|    2.0|     adult| +----+----+-------+----------+

answered Oct 05 '22 03:10

pauli

Related questions
                            
                                Extremely slow S3 write times from EMR/ Spark
                            
                                The value of "spark.yarn.executor.memoryOverhead" setting?
                            
                                What are the differences between saveAsTable and insertInto in different SaveMode(s)?
                            
                                Create a custom Transformer in PySpark ML
                            
                                spark access first n rows - take vs limit
                            
                                When to cache a DataFrame?
                            
                                How do I read a parquet in PySpark written from Spark?
                            
                                How to create an empty DataFrame? Why "ValueError: RDD is empty"?
                            
                                get min and max from a specific column scala spark dataframe
                            
                                writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark
                            
                                Spark Unable to find JDBC Driver
                            
                                Spark 2.0 missing spark implicits
                            
                                Use Spring together with Spark
                            
                                Does Spark support true column scans over parquet files in S3?
                            
                                scalac compile yields "object apache is not a member of package org"
                            
                                Spark-submit not working when application jar is in hdfs
                            
                                How can I force Spark to execute code?
                            
                                Why does Spark fail with "Detected cartesian product for INNER join between logical plans"?
                            
                                remove a column from a dataframe spark
                            
                                Primary keys with Apache Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With