Merge Schema with int and double cannot be resolved when reading parquet file

Tags:

I've got two parquet files, one contains an integer field myField and another contains a double field myField. When attempting to read both the files at once

val basePath = "/path/to/file/"
val fileWithInt = basePath + "intFile.snappy.parquet"
val fileWithDouble = basePath + "doubleFile.snappy.parquet"
val result = spark.sqlContext.read.option("mergeSchema", true).option("basePath", basePath).parquet(Seq(fileWithInt, fileWithDouble): _*).select("myField")

I get the following error

Caused by: org.apache.spark.SparkException: Failed to merge fields 'myField' and 'myField'. Failed to merge incompatible data types IntegerType and DoubleType

When passing an explicit schema

val schema = StructType(Seq(new StructField("myField", IntegerType)))
val result = spark.sqlContext.read.schema(schema).option("mergeSchema", true).option("basePath", basePath).parquet(Seq(fileWithInt, fileWithDouble): _*).select("myField")

It fails with the following

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
    at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)

When casting up to a double

val schema = StructType(Seq(new StructField("myField", DoubleType)))

I get

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
    at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:60)

Does anyone know any ways around this problem other than reprocessing the source data.

346

asked Dec 18 '18 09:12

John Cragg

1 Answers

Depending on the number of files you are going to read you can use one of these two approachs:

This would be best for a smaller number of parquet files

def merge(spark: SparkSession, paths: Seq[String]): DataFrame = {
    import spark.implicits._

    paths.par.map {
      path =>
        spark.read.parquet(path).withColumn("myField", $"myField".cast(DoubleType))
    }.reduce(_.union(_))
  }

This approach will be better to process a large number of files since it will keep lineage short

def merge2(spark: SparkSession, paths: Seq[String]): DataFrame = {
    import spark.implicits._

    spark.sparkContext.union(paths.par.map {
      path =>
        spark.read.parquet(path).withColumn("myField", $"myField".cast(DoubleType)).as[Double].rdd
    }.toList).toDF
  }

163

answered Sep 23 '22 19:09

Mikel San Vicente

Related questions
                            
                                How to log to an explicit AWS CloudWatch log stream and change it programmatically (Java/Scala/log4j)
                            
                                Scala & Spark: Recycling SQL statements
                            
                                Spark colocated join between two partitioned dataframes
                            
                                akka-stream + akka-http lifecycle
                            
                                How to use a PySpark UDF in a Scala Spark project?
                            
                                sbt always recompiles full project in CI, even with caching?
                            
                                idiomatic property changed notification in scala?
                            
                                Reflection for nested classes
                            
                                Scala mutable BitSet, where are the mutating operations?
                            
                                Code completion issues with the Scala-IDE and Eclipse Juno
                            
                                Scala play http filters: how to find the request body
                            
                                How to run simple Spark app from Eclipse/Intellij IDE?
                            
                                How to run tests on every code change in IntelliJ IDEA from Scala sbt project?
                            
                                scala.ScalaReflectionException: <none> is not a term
                            
                                Accessing HBase tables through Spark
                            
                                How to Marshall a Future[Option[Foo]] class to JSON in AKKA-HTTP
                            
                                How to concretely set abstract type with type bound?
                            
                                What is the canonical way to deploy Scala/Akka microservices?
                            
                                Scala doobie fragment with generic type parameter
                            
                                How to set breakpoints in vs code in a scala program

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Merge Schema with int and double cannot be resolved when reading parquet file

Tags:

scala

apache-spark

apache-spark-sql

John Cragg

People also ask

1 Answers

Mikel San Vicente

Recent Activity

Donate For Us