rdd.histogram gives "can not generate buckets with non-number in RDD" error

Question

Use the following one-column dataframe,

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df = spark.createDataFrame([[1],[2],[3],[4],[5]])
df.show()

+---+
| _1|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
+---+

The compute a histogram using rdd's histogram function.

df.rdd.histogram(2)

Then I get an error: Can not generate buckets with non-number in RDD. I am confused because all values in my dataframe are numbers.

enter image description here

Oli · Accepted Answer

The problem is that df.rdd is a RDD of rows, and rows are not numbers. You can verify it by calling collect in the pyspark shell for instance:

>>> df.rdd.collect()
[Row(_1=1), Row(_1=2), Row(_1=3), Row(_1=4), Row(_1=5)]

To make this work, you can simply extract your numeric column from the row like this:

>>> df.rdd.map(lambda x : x[0]).histogram(2)
([1, 3, 5], [2, 3])

rdd.histogram gives "can not generate buckets with non-number in RDD" error

Tags:

apache-spark

pyspark

Tony

1 Answers

Oli

Recent Activity

Donate For Us

rdd.histogram gives "can not generate buckets with non-number in RDD" error

Tags:

apache-spark

pyspark

Tony

1 Answers

Oli

Related questions

Recent Activity

Donate For Us