Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataframe Returning NULL when specifying a Schema

I am working on converting a JavaRDD (where the string is a JSON string) into a dataframe and display it. I am doing something like below,

public void call(JavaRDD<String> rdd, Time time) throws Exception {
            if (rdd.count() > 0) {
                JavaRDD<String> filteredRDD = rdd.filter(x -> x.length()>0);                    
                sqlContext = SQLContextSingleton.getInstance(filteredRDD.context());
                DataFrame df = sqlContext.read().schema(SchemaBuilder.buildSchema()).json(filteredRDD);
                df.show();
            }
        }

Where as the schema is as follows,

public static StructType buildSchema() {
    StructType schema = new StructType(
            new StructField[] { DataTypes.createStructField("student_id", DataTypes.StringType, false),
                    DataTypes.createStructField("school_id", DataTypes.IntegerType, false),
                    DataTypes.createStructField("teacher", DataTypes.StringType, true),
                    DataTypes.createStructField("rank", DataTypes.StringType, true),
                    DataTypes.createStructField("created", DataTypes.TimestampType, true),
                    DataTypes.createStructField("created_user", DataTypes.StringType, true),
                    DataTypes.createStructField("notes", DataTypes.StringType, true),
                    DataTypes.createStructField("additional_data", DataTypes.StringType, true),
                    DataTypes.createStructField("datetime", DataTypes.TimestampType, true) });
    return (schema);
}

The above code is returning me,

|student_id|school_id|teacher|rank|created|created_user|notes|additional_data|datetime|
+----------+------+--------+-----+-----------+-------+------------+--------+-------------+-----+-------------------+---------+---------------+--------+----+-------+-----------+
|      null|  null|    null| null|       null|   null|        null|    null|         null|

But, When I don't specify the schema and create the Dataframe as,

DataFrame df = sqlContext.read().json(filteredRDD);

This returned me the result as below,

|student_id|school_id|teacher|rank|created|created_user|notes|additional_data|datetime|
        +----------+------+--------+-----+-----------+-------+------------+--------+-------------+-----+-------------------+---------+---------------+--------+----+-------+-----------+
        |      1|  123|    xxx| 3|       2017-06-02 23:49:10.410|   yyyy|        NULL|    good academics|         2017-06-02 23:49:10.410|

Sample JSON Record:

{"student_id": "1","school_id": "123","teacher": "xxx","rank": "3","created": "2017-06-02 23:49:10.410","created_user":"yyyy","notes": "NULL","additional_date":"good academics","datetime": "2017-06-02 23:49:10.410"}

Any help on what I am doing wrong?

like image 244
Cheater Avatar asked Sep 06 '17 04:09

Cheater


1 Answers

The problem is that in my json record, school_id is of string type and spark explicitly cannot convert from String to Integer. In that case it considers the entire record as null. I modified my schema to represent school_id as StringType which resolved my problem. Some good explanation for it is provided at: http://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/

like image 151
Cheater Avatar answered Oct 15 '22 05:10

Cheater