Use the following one-column dataframe,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df = spark.createDataFrame([[1],[2],[3],[4],[5]])
df.show()
+---+
| _1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
The compute a histogram using rdd's histogram function.
df.rdd.histogram(2)
Then I get an error: Can not generate buckets with non-number in RDD. I am confused because all values in my dataframe are numbers.
The problem is that df.rdd
is a RDD of rows, and rows are not numbers. You can verify it by calling collect in the pyspark shell for instance:
>>> df.rdd.collect()
[Row(_1=1), Row(_1=2), Row(_1=3), Row(_1=4), Row(_1=5)]
To make this work, you can simply extract your numeric column from the row like this:
>>> df.rdd.map(lambda x : x[0]).histogram(2)
([1, 3, 5], [2, 3])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With