I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders. Below is an example of a map operation: <code>In the example below, instead of returning Dataset<String>, I would like to return Dataset<Row></code> <pre class="prettyprint"><code>Dataset<String> output = dataset1.flatMap(new FlatMapFunction<Row, String>() { @Override public Iterator<String> call(Row row) throws Exception { ArrayList<String> obj = //some map operation return obj.iterator(); } },Encoders.STRING()); </code></pre> I understand that instead of a string Encoder needs to be written as follows: <pre class="prettyprint"><code> Encoder<Row> encoder = new Encoder<Row>() { @Override public StructType schema() { return join.schema(); //return null; } @Override public ClassTag<Row> clsTag() { return null; } }; </code></pre> However, I do not understand the clsTag() in the encoder, and I am trying to find a running example which can demostrate something similar (i.e. an encoder for a row type) Edit - This is not a copy of the question mentioned : Encoder error while trying to map dataframe row to updated row as the answer talks about using Spark 1.x in Spark 2.x (I am not doing so), also I am looking for an encoder for a Row class rather than resolve an error. Finally, I was looking for a solution in Java, not in Scala.

The answer is to use a RowEncoder and the schema of the dataset using StructType. Below is a working example of a flatmap operation with Datasets: <pre class="prettyprint"><code> StructType structType = new StructType(); structType = structType.add("id1", DataTypes.LongType, false); structType = structType.add("id2", DataTypes.LongType, false); ExpressionEncoder<Row> encoder = RowEncoder.apply(structType); Dataset<Row> output = join.flatMap(new FlatMapFunction<Row, Row>() { @Override public Iterator<Row> call(Row row) throws Exception { // a static map operation to demonstrate List<Object> data = new ArrayList<>(); data.add(1l); data.add(2l); ArrayList<Row> list = new ArrayList<>(); list.add(RowFactory.create(data.toArray())); return list.iterator(); } }, encoder); </code></pre>

Encoder for Row Type Spark Datasets

Tags:

java

apache-spark

apache-spark-sql

apache-spark-dataset

apache-spark-encoders

I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders.

Below is an example of a map operation:

In the example below, instead of returning Dataset<String>, I would like to return Dataset<Row>

Dataset<String> output = dataset1.flatMap(new FlatMapFunction<Row, String>() {             @Override             public Iterator<String> call(Row row) throws Exception {                  ArrayList<String> obj = //some map operation                 return obj.iterator();             }         },Encoders.STRING());

I understand that instead of a string Encoder needs to be written as follows:

    Encoder<Row> encoder = new Encoder<Row>() {         @Override         public StructType schema() {             return join.schema();             //return null;         }          @Override         public ClassTag<Row> clsTag() {             return null;         }     };

However, I do not understand the clsTag() in the encoder, and I am trying to find a running example which can demostrate something similar (i.e. an encoder for a row type)

Edit - This is not a copy of the question mentioned : Encoder error while trying to map dataframe row to updated row as the answer talks about using Spark 1.x in Spark 2.x (I am not doing so), also I am looking for an encoder for a Row class rather than resolve an error. Finally, I was looking for a solution in Java, not in Scala.

761

asked Apr 05 '17 18:04

tsar2512

2 Answers

The answer is to use a RowEncoder and the schema of the dataset using StructType.

Below is a working example of a flatmap operation with Datasets:

    StructType structType = new StructType();     structType = structType.add("id1", DataTypes.LongType, false);     structType = structType.add("id2", DataTypes.LongType, false);      ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);      Dataset<Row> output = join.flatMap(new FlatMapFunction<Row, Row>() {         @Override         public Iterator<Row> call(Row row) throws Exception {             // a static map operation to demonstrate             List<Object> data = new ArrayList<>();             data.add(1l);             data.add(2l);             ArrayList<Row> list = new ArrayList<>();             list.add(RowFactory.create(data.toArray()));             return list.iterator();         }     }, encoder);

answered Sep 22 '22 08:09

tsar2512

I had the same problem... Encoders.kryo(Row.class)) worked for me.

As a bonus, the Apache Spark tuning docs refer to Kryo it since it’s faster at serialization "often as much as 10x":

https://spark.apache.org/docs/latest/tuning.html

answered Sep 24 '22 08:09

Jim Bob

Related questions
                            
                                Real-time Java graph / chart library? [closed]
                            
                                ORA-01654: unable to extend index
                            
                                Looking for a question that combines the understanding of few web technologies [closed]
                            
                                Understanding the Etc/GMT time zone
                            
                                How to encode/decode Kafka messages using Avro binary encoder?
                            
                                Finding the nearest common superclass (or superinterface) of a collection of classes
                            
                                Java best practices when throwing exceptions: throwing core Java exceptions
                            
                                How do I write a compareTo method which compares objects?
                            
                                Why are the RSA-SHA256 signatures I generate with OpenSSL and Java different?
                            
                                ExecuteBatch method return array of value -2 in java
                            
                                How to setup Retrofit with no baseUrl
                            
                                What is the default session timeout for a Java EE website?
                            
                                Java Warning using Vectors: unchecked call to add(E)
                            
                                Migrating Java to Scala
                            
                                How to pass an Integer Array to IN clause in MyBatis
                            
                                Eclipse: How to share Java compiler errors/warnings settings across entire team
                            
                                "Could not find the main class" error when running jar exported by Eclipse
                            
                                Overloading in Java and multiple dispatch
                            
                                Spring-Security: Difference Between /** and /* url pattern in Spring-Security
                            
                                Rotating a NxN matrix in Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Encoder for Row Type Spark Datasets

Tags:

java

apache-spark

apache-spark-sql

apache-spark-dataset

apache-spark-encoders

tsar2512

People also ask

2 Answers

tsar2512

Jim Bob

Recent Activity

Donate For Us