Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala Spark - creating nested json output from simple dataframe

Thanks for getting back. But the problem I am facing is while writing those structs into nested json. Somehow 'tojson' is not working and is just skipping the nested fields resulting into a flat structure always. How can I write into nested json format into HDFS?

like image 420
Subhadip Avatar asked Oct 30 '22 00:10

Subhadip


1 Answers

You should be creating the struct fields from the fields which have to be be nested together. Below is an working example : Assume you have a employee data in csv format containing companyname,employee and department name and you would want to list all the employees per department per company in json format. Below is the code for the same.

  import java.util.List;
  import org.apache.spark.sql.Dataset;
  import org.apache.spark.sql.Row;
  import org.apache.spark.sql.RowFactory;
  import org.apache.spark.sql.SparkSession;
  import org.apache.spark.sql.api.java.UDF2;
  import org.apache.spark.sql.types.DataTypes;
  import org.apache.spark.sql.types.StructField;

  import scala.collection.mutable.WrappedArray;
public class JsonExample {
public static void main(String [] args)
 {
    SparkSession sparkSession = SparkSession
              .builder()
              .appName("JsonExample")
              .master("local")
              .getOrCreate();

    //read the csv file
    Dataset<Row> employees = sparkSession.read().option("header", "true").csv("/tmp/data/emp.csv");
    //create the temp view
    employees.createOrReplaceTempView("employees");

    //First , group the employees based on company AND department 
    sparkSession.sql("select company,department,collect_list(name) as department_employees from employees group by company,department").createOrReplaceTempView("employees");
    /*Now create a struct by invoking the UDF create_struct. 
     * The struct will contain department and the list of employees 
    */
    sparkSession.sql("select company,collect_list(struct(department,department_employees)) as department_info from employees group by company").toJSON().show(false);



 }
}

You can find the same example on my blog: http://baahu.in/spark-how-to-generate-nested-json-using-dataset/

like image 183
BJC Avatar answered Nov 15 '22 05:11

BJC