For example, I'd like to classify a DataFrame
of people into the following 4 bins according to age.
age_bins = [0, 6, 18, 60, np.Inf] age_labels = ['infant', 'minor', 'adult', 'senior']
I would use pandas.cut()
to do this in pandas
. How do I do this in PySpark
?
Method 2: Using pyspark.sql.DataFrame.select(*cols)select() create a new column in DataFrame and set it to default values.
By doing so you first compute the percent_rank, and then you multiply this by 10 and take the upper integer. Consequently, all values with a percent_rank between 0 and 0.1 will be added to decile 1, all values with a percent_rank between 0.1 and 0.2 will be added to decile 2, etc.
Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.
PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.
You can use Bucketizer feature transfrom from ml library in spark.
values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)] df = spark.createDataFrame(values, ["name", "ages"]) from pyspark.ml.feature import Bucketizer bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets") df_buck = bucketizer.setHandleInvalid("keep").transform(df) df_buck.show()
output
+----+----+-------+ |name|ages|buckets| +----+----+-------+ | a| 23| 2.0| | b| 45| 2.0| | c| 10| 1.0| | d| 60| 3.0| | e| 56| 2.0| | f| 2| 0.0| | g| 25| 2.0| | h| 40| 2.0| | j| 33| 2.0| +----+----+-------+
If you want names for each bucket you can use udf to create a new column with bucket names
from pyspark.sql.functions import udf from pyspark.sql.types import * t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"} udf_foo = udf(lambda x: t[x], StringType()) df_buck.withColumn("age_bucket", udf_foo("buckets")).show()
output
+----+----+-------+----------+ |name|ages|buckets|age_bucket| +----+----+-------+----------+ | a| 23| 2.0| adult| | b| 45| 2.0| adult| | c| 10| 1.0| minor| | d| 60| 3.0| senior| | e| 56| 2.0| adult| | f| 2| 0.0| infant| | g| 25| 2.0| adult| | h| 40| 2.0| adult| | j| 33| 2.0| adult| +----+----+-------+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With