I am trying to move data from greenplum to HDFS using Spark. I can read the data successfully from the source table and the spark inferred schema of the dataframe (of the greenplum table) is:
DataFrame Schema:
je_header_id: long (nullable = true)
je_line_num: long (nullable = true)
last_updated_by: decimal(15,0) (nullable = true)
last_updated_by_name: string (nullable = true)
ledger_id: long (nullable = true)
code_combination_id: long (nullable = true)
balancing_segment: string (nullable = true)
cost_center_segment: string (nullable = true)
period_name: string (nullable = true)
effective_date: timestamp (nullable = true)
status: string (nullable = true)
creation_date: timestamp (nullable = true)
created_by: decimal(15,0) (nullable = true)
entered_dr: decimal(38,20) (nullable = true)
entered_cr: decimal(38,20) (nullable = true)
entered_amount: decimal(38,20) (nullable = true)
accounted_dr: decimal(38,20) (nullable = true)
accounted_cr: decimal(38,20) (nullable = true)
accounted_amount: decimal(38,20) (nullable = true)
xx_last_update_log_id: integer (nullable = true)
source_system_name: string (nullable = true)
period_year: decimal(15,0) (nullable = true)
period_num: decimal(15,0) (nullable = true)
The corresponding schema of the Hive table is:
je_header_id:bigint|je_line_num:bigint|last_updated_by:bigint|last_updated_by_name:string|ledger_id:bigint|code_combination_id:bigint|balancing_segment:string|cost_center_segment:string|period_name:string|effective_date:timestamp|status:string|creation_date:timestamp|created_by:bigint|entered_dr:double|entered_cr:double|entered_amount:double|accounted_dr:double|accounted_cr:double|accounted_amount:double|xx_last_update_log_id:int|source_system_name:string|period_year:bigint|period_num:bigint
Using the Hive table schema mentioned above, I created the below StructType from using the logic:
def convertDatatype(datatype: String): DataType = {
val convert = datatype match {
case "string" => StringType
case "bigint" => LongType
case "int" => IntegerType
case "double" => DoubleType
case "date" => TimestampType
case "boolean" => BooleanType
case "timestamp" => TimestampType
}
convert
}
Prepared Schema:
je_header_id: long (nullable = true)
je_line_num: long (nullable = true)
last_updated_by: long (nullable = true)
last_updated_by_name: string (nullable = true)
ledger_id: long (nullable = true)
code_combination_id: long (nullable = true)
balancing_segment: string (nullable = true)
cost_center_segment: string (nullable = true)
period_name: string (nullable = true)
effective_date: timestamp (nullable = true)
status: string (nullable = true)
creation_date: timestamp (nullable = true)
created_by: long (nullable = true)
entered_dr: double (nullable = true)
entered_cr: double (nullable = true)
entered_amount: double (nullable = true)
accounted_dr: double (nullable = true)
accounted_cr: double (nullable = true)
accounted_amount: double (nullable = true)
xx_last_update_log_id: integer (nullable = true)
source_system_name: string (nullable = true)
period_year: long (nullable = true)
period_num: long (nullable = true)
When I try to apply my newSchema on the dataframe Schema, I get an exception:
java.lang.RuntimeException: java.math.BigDecimal is not a valid external type for schema of bigint
I understand that it is trying to convert BigDecimal to Bigint and it fails, but could anyone tell me how do I cast the bigint to a spark compatible datatype ?
If not, how can I modify my logic to give proper datatypes in the case statement for this bigint/bigdecimal problem ?
Here by seeing your question, seems like you are trying to convert bigint value to big decimal, which is not right. Bigdecimal is a decimal that must have fixed precision (the maximum number of digits) and scale (the number of digits on right side of dot). And your's is seems like long value.
Here instead of using BigDecimal datatype, try to use LongType to convert bigint value correctly. See if this solve your purpose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With