Spark SQL UDF with complex input parameter

Tags:

I'm trying to use UDF with input type Array of struct. I have the following structure of data this is only relevant part of a bigger structure

|--investments: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- funding_round: struct (nullable = true)
    |    |    |    |-- company: struct (nullable = true)
    |    |    |    |    |-- name: string (nullable = true)
    |    |    |    |    |-- permalink: string (nullable = true)
    |    |    |    |-- funded_day: long (nullable = true)
    |    |    |    |-- funded_month: long (nullable = true)
    |    |    |    |-- funded_year: long (nullable = true)
    |    |    |    |-- raised_amount: long (nullable = true)
    |    |    |    |-- raised_currency_code: string (nullable = true)
    |    |    |    |-- round_code: string (nullable = true)
    |    |    |    |-- source_description: string (nullable = true)
    |    |    |    |-- source_url: string (nullable = true)

I declared case classes:

case class Company(name: String, permalink: String)
case class FundingRound(company: Company, funded_day: Long, funded_month: Long, funded_year: Long, raised_amount: Long, raised_currency_code: String, round_code: String, source_description: String, source_url: String)
case class Investments(funding_round: FundingRound)

UDF declaration:

sqlContext.udf.register("total_funding", (investments:Seq[Investments])  => {
     val totals = investments.map(r => r.funding_round.raised_amount)
     totals.sum
})

When I'm executing the following transformation the result is as expected

scala> sqlContext.sql("""select total_funding(investments) from companies""")
res11: org.apache.spark.sql.DataFrame = [_c0: bigint]

But when an action executed like collect I have an error:

Executor: Exception in task 0.0 in stage 4.0 (TID 10)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line33.$read$$iwC$$iwC$Investments

Thank you for any help.

791

asked Jul 16 '16 16:07

Lev

1 Answers

The error you see should be pretty much self-explanatory. There is a strict mapping between Catalyst / SQL types and Scala types which can be found in the relevant section of the Spark SQL, DataFrames and Datasets Guide.

In particular struct types are converted to o.a.s.sql.Row (in your particular case data will be exposed as Seq[Row]).

There are different methods which can be used to expose data as specific types:

Defining UDT (user defined type) which has been removed in 2.0.0 and has no replacement for now.
Converting DataFrame to Dataset[T] where T is a desired local type.

with only the former approach could be applicable in this particular scenario.

If you want to access investments.funding_round.raised_amount using UDF you'll need something like this:

val getRaisedAmount = udf((investments: Seq[Row]) => scala.util.Try(
  investments.map(_.getAs[Row]("funding_round").getAs[Long]("raised_amount"))
).toOption)

but simple select should be much safer and cleaner:

df.select($"investments.funding_round.raised_amount")

178

answered Jan 01 '23 07:01

zero323

Related questions
                            
                                Error using spark 'save' does not support bucketing right now
                            
                                How to find installation directory of Apache Spark package in Homebrew?
                            
                                Get index of item in array that is a column in a Spark dataframe
                            
                                Correct Parquet file size when storing in S3?
                            
                                Optimal file size and parquet block size
                            
                                Adding external jars in EMR Notebooks
                            
                                Spark/Hadoop throws exception for large LZO files
                            
                                simple mapping partitions job in (py)spark
                            
                                Deploy mode in "SPARK-SUBMIT"
                            
                                Load Spark data locally Incomplete HDFS URI
                            
                                Requirements for converting Spark dataframe to Pandas/R dataframe
                            
                                creating spark data structure from multiline record
                            
                                How to use secondary user actions with to improve recommendations with Spark ALS?
                            
                                RDD to LabeledPoint conversion
                            
                                Find size of data stored in rdd from a text file in apache spark
                            
                                com.mysql.jdbc.Driver not found on classpath while starting spark sql and thrift server
                            
                                import Spark source code into intellij, build Error: not found: type SparkFlumeProtocol and EventBatch
                            
                                Convert Spark DataFrame to Pojo Object
                            
                                Spark Execution of TB file in memory
                            
                                Spark Redshift with Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark SQL UDF with complex input parameter

Tags:

dataframe

apache-spark

apache-spark-sql

user-defined-functions

Lev

People also ask

1 Answers

zero323

Recent Activity

Donate For Us