I have a CSV document I'm loading into a SQLContext that contains latitude and longitude columns.
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").option("delimiter","\t").schema(customSchema).load(inputFile);
CSV example
metro_code, resolved_lat, resolved_lon
602, 40.7201, -73.2001
I'm trying to figure out the best way to add a new column and calculate the GeoHex for each row. Hashing the lat and long is easy with the geohex package. I think I need to run the parallelize method or I've seen some examples passing a function to withColumn.
Wrapping required function with an UDF should do the trick:
import org.apache.spark.sql.functions.udf
import org.geohex.geohex4j.GeoHex
val df = sc.parallelize(Seq(
(Some(602), 40.7201, -73.2001), (None, 5.7805, 139.5703)
)).toDF("metro_code", "resolved_lat", "resolved_lon")
def geoEncode(level: Int) = udf(
(lat: Double, long: Double) => GeoHex.encode(lat, long, level))
df.withColumn("code", geoEncode(9)($"resolved_lat", $"resolved_lon")).show
// +----------+------------+------------+-----------+
// |metro_code|resolved_lat|resolved_lon| code|
// +----------+------------+------------+-----------+
// | 602| 40.7201| -73.2001|PF384076026|
// | null| 5.7805| 139.5703|PR081331784|
// +----------+------------+------------+-----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With