Returning Multiple Arrays from User-Defined Aggregate Function (UDAF) in Apache Spark SQL

Tags:

I am trying to create a user-defined aggregate function (UDAF) in Java using Apache Spark SQL that returns multiple arrays on completion. I have searched online and cannot find any examples or suggestions on how to do this.

I am able to return a single array, but cannot figure out how to get the data in the correct format in the evaluate() method for returning multiple arrays.

The UDAF does work as I can print out the arrays in the evaluate() method, I just can't figure out how to return those arrays to the calling code (which is shown below for reference).

UserDefinedAggregateFunction customUDAF = new CustomUDAF();
DataFrame resultingDataFrame = dataFrame.groupBy().agg(customUDAF.apply(dataFrame.col("long_col"), dataFrame.col("double_col"))).as("processed_data");

I have included the whole custom UDAF class below, but the key methods are the dataType() and evaluate methods(), which are shown first.

Any help or advice would be greatly appreciated. Thank you.

public class CustomUDAF extends UserDefinedAggregateFunction {

    @Override
    public DataType dataType() {
        // TODO: Is this the correct way to return 2 arrays?
        return new StructType().add("longArray", DataTypes.createArrayType(DataTypes.LongType, false))
            .add("dataArray", DataTypes.createArrayType(DataTypes.DoubleType, false));
    }

    @Override
    public Object evaluate(Row buffer) {
        // Data conversion
        List<Long> longList = new ArrayList<Long>(buffer.getList(0));
        List<Double> dataList = new ArrayList<Double>(buffer.getList(1));

        // Processing of data (omitted)

        // TODO: How to get data into format needed to return 2 arrays?
        return dataList;
    }

    @Override
    public StructType inputSchema() {
        return new StructType().add("long", DataTypes.LongType).add("data", DataTypes.DoubleType);
    }

    @Override
    public StructType bufferSchema() {
        return new StructType().add("longArray", DataTypes.createArrayType(DataTypes.LongType, false))
            .add("dataArray", DataTypes.createArrayType(DataTypes.DoubleType, false));
    }

    @Override
    public void initialize(MutableAggregationBuffer buffer) {
        buffer.update(0, new ArrayList<Long>());
        buffer.update(1, new ArrayList<Double>());
    }

    @Override
    public void update(MutableAggregationBuffer buffer, Row row) {
        ArrayList<Long> longList = new ArrayList<Long>(buffer.getList(0));
        longList.add(row.getLong(0));

        ArrayList<Double> dataList = new ArrayList<Double>(buffer.getList(1));
        dataList.add(row.getDouble(1));

        buffer.update(0, longList);
        buffer.update(1, dataList);
    }

    @Override
    public void merge(MutableAggregationBuffer buffer1, Row buffer2) {
        ArrayList<Long> longList = new ArrayList<Long>(buffer1.getList(0));
        longList.addAll(buffer2.getList(0));

        ArrayList<Double> dataList = new ArrayList<Double>(buffer1.getList(1));
        dataList.addAll(buffer2.getList(1));

        buffer1.update(0, longList);
        buffer1.update(1, dataList);
    }

    @Override
    public boolean deterministic() {
        return true;
    }
}

Update: Based on the answer by zero323 I was able to return two arrays using:

return new Tuple2<>(longArray, dataArray);

Getting the data out of this was a bit of a struggle but involved deconstructing the DataFrame to Java Lists and then building it back to a DataFrame.

640

asked Nov 26 '15 13:11

ab853

1 Answers

As far as I can tell returning a tuple should be just enough. In Scala:

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.{Row, Column}

object DummyUDAF extends UserDefinedAggregateFunction {
  def inputSchema = new StructType().add("x", StringType)
  def bufferSchema = new StructType()
    .add("buff", ArrayType(LongType))
    .add("buff2", ArrayType(DoubleType))
  def dataType = new StructType()
    .add("xs", ArrayType(LongType))
    .add("ys", ArrayType(DoubleType))
  def deterministic = true 
  def initialize(buffer: MutableAggregationBuffer) = {}
  def update(buffer: MutableAggregationBuffer, input: Row) = {}
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {}
  def evaluate(buffer: Row) = (Array(1L, 2L, 3L), Array(1.0, 2.0, 3.0))
}

val df =  sc.parallelize(Seq(("a", 1), ("b", 2))).toDF("k", "v")
df.select(DummyUDAF($"k")).show(1, false)

// +---------------------------------------------------+
// |(DummyUDAF$(k),mode=Complete,isDistinct=false)     |
// +---------------------------------------------------+
// |[WrappedArray(1, 2, 3),WrappedArray(1.0, 2.0, 3.0)]|
// +---------------------------------------------------+

179

answered Oct 26 '22 05:10

zero323

Related questions
                            
                                Load balancing server, how can I implement it?
                            
                                Writing files to external storage in Android failing
                            
                                Safe workaround for broken contravariant bounds in Java?
                            
                                Iterate through a Java RDD by row
                            
                                File upload in chunks using jquery file upload plugin in JAVA
                            
                                How to set the buffer size on a BufferedWriter over a FileWriter
                            
                                Static methods loading/unloading and memory area in java?
                            
                                Mock object method call using Spring Boot and Mockito
                            
                                "Not known whether passed class name ... is safe" using JPA EmbeddedId with Hibernate
                            
                                How to check in java if the PC is connected to the network printer?
                            
                                Spring: JSON data and file in the same request
                            
                                Guava: how to customly reduce a multimap?
                            
                                Why would I use a chain of responsibility over a switch-statement
                            
                                Eclipse Java 8 Auto completion for Lambda expressions
                            
                                How to access images from a resource folder in intelliJ IDEA without having to give entire path name
                            
                                SimpleDateFormat Ignore Characters
                            
                                findViewById() may produce NullPointerException
                            
                                Java NIO watch Service created both 'ENTRY_CREATE' and 'ENTRY_MODIFY' when a new File is added to the watch folder
                            
                                How to refresh entity after "manual" backend query update [duplicate]
                            
                                JOptionPane icon gets cropped in Windows 10

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Returning Multiple Arrays from User-Defined Aggregate Function (UDAF) in Apache Spark SQL

Tags:

java

aggregate-functions

apache-spark

apache-spark-sql

user-defined-functions

ab853

People also ask

1 Answers

zero323

Recent Activity

Donate For Us