How to avoid boxing bytes in array in custom datasource?

Question

I'm working on a custom Spark datasource and want the schema to include a row of primitive byte array type.

My problem is that the bytes in the resulting byte array get boxed: the output then has type WrappedArray$ofRef. This means that each byte is represented as a java.lang.Object. Although I can work around this, I'm concerned about the computational and memory overheads, which are of critical importance for my application. I really just want primitive arrays!

Below is a minimal example which demonstrates this behaviour.

class DefaultSource extends SchemaRelationProvider with DataSourceRegister {

    override def shortName(): String = "..."

    override def createRelation(
                                    sqlContext: SQLContext,
                                    parameters: Map[String, String],
                                    schema: StructType = new StructType()
                               ): BaseRelation = {
        new DefaultRelation(sqlContext)
    }
}

class DefaultRelation(val sqlContext: SQLContext) extends BaseRelation with PrunedFilteredScan {

    override def schema = {
        StructType(
            Array(
                StructField("key", ArrayType(ByteType))
            )
        )
    }

    override def buildScan(
                              requiredColumnNames: Array[String],
                              filterArr: Array[Filter]
                          ): RDD[Row] = {
        testRDD
    }

    def testRDD = sqlContext.sparkContext.parallelize(
        List(
            Row(
                Array[Byte](1)
            )
        )
    )
}

Using this sample datasource as follows:

def schema = StructType(Array(StructField("key", ArrayType(ByteType))))
val rows = sqlContext
        .read
        .schema(schema)
        .format("testdatasource")
        .load
        .collect()
println(rows(0)(0).getClass)

Then generates the following output:

class scala.collection.mutable.WrappedArray$ofRef

Inspecting the result type further in the debugger confirms that the bytes in the WrappedArray are indeed boxed - and for some reason their type is erased all the way up to java.lang.Object (rather than to java.lang.Byte).

Note that using an RDD directly, without going through the datasource APIs, leads to the expected result of primitive byte arrays.

Any suggestions as to how to solve this problem would be greatly appreciated.

Michael Pedersen · Accepted Answer

Ok, so for primitive byte arrays I should have used BinaryType instead of Array(Byte) as the column type. This fixes the problem.

Out of curiosity, if we change ArrayType(ByteType) to e.g. ArrayType(LongType) in the examples above, we actually get a runtime exception indicating that boxed longs are expected. So, it seems that primitives in Spark SQL arrays are always boxed.

How to avoid boxing bytes in array in custom datasource?

Tags:

scala

apache-spark

apache-spark-sql

Michael Pedersen

1 Answers

Michael Pedersen

Recent Activity

Donate For Us

How to avoid boxing bytes in array in custom datasource?

Tags:

scala

apache-spark

apache-spark-sql

Michael Pedersen

1 Answers

Michael Pedersen

Related questions

Recent Activity

Donate For Us