I'm on Spark 1.3.
I would like to apply a function to each row of a dataframe. This function hashes each column of the row and returns a list of the hashes.
dataframe.map(row => row.toSeq.map(col => col.hashCode))
I get a NullPointerException when I run this code. I assume that this is related to SPARK-5063.
I can't think of a way to achieve the same result without using a nested map.
This isn't an instance of SPARK-5063 because you're not nesting RDD transformations; the inner .map()
is being applied to a Scala Seq
, not an RDD.
My hunch is that some rows in your data set contain null column values, so some of the col.hashCode
calls are throwing NullPointerExceptions when you try to evaluate null.hashCode
. In order to work around this, you need to take nulls into account when computing hashcodes.
If you're running on a Java 7 JVM or higher (source), you can do
import java.util.Objects
dataframe.map(row => row.toSeq.map(col => Objects.hashCode(col)))
Alternatively, on earlier versions of Java you can do
dataframe.map(row => row.toSeq.map(col => if (col == null) 0 else col.hashCode))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With