Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SPARK dataframe returning null when trying to apply schema to JSON data

I'm using the SPARK Java API to read a text file, convert it to JSON, and then apply a schema to it. The schema can vary based on a mapping table in the database, which is why I need to first convert the file to JSON so the schema mapping does not have to be in column order. Here is what I've done:

// Defined the schema (basic representation)
StructType myschema = new StructType().add("a", DataTypes.StringType, true)
                                      .add("b", DataTypes.StringType, true)
                                      .add("x", DataTypes.StringType, true)
                                      .add("y", DataTypes.IntegerType, true)
                                      .add("z", DataTypes.BooleanType, true);

//Reading a pipe delimited text file as JSON, the file has less columns than myschema
Dataset<String> data = spark.read().option("delimiter","|").option("header","true").csv(myFile).toJSON();

The above table returns something like this:

data.show(false);

|value|
+----------------------------------------+
|      {"x":"name1","z":"true","y":"1234"}|
|      {"x":"name2","z":"false","y":"1445"}|
|      {"x":"name3","z":"true",:y":"1212"}|

My issue comes when I run this:

Dataset<Row> data_with_schema = spark.read().schema(myschema).json(data);

Because my result turns into this:

data_with_schema.show(false);
|x|y|z|
+-------+-------+-------+
|null  |null  |null  |
|null  |null  |null  |
|null  |null  |null  |

I read on stackoverflow that this might be because I'm trying to cast json strings as integers. However, I tried to define the data variable as a Row Dataset instead of String Dataset but there was an Incompatible Types error. I'm not sure what the workaround is or what the real issue is.

like image 477
Mitchan Avatar asked Jan 26 '23 17:01

Mitchan


1 Answers

Figured out the problem:

If there is data in the inputted file that cannot have a schema applied to it, it will return Null for ALL the data in your table. For example: "1n" is impossible to convert to integer. If a DataTypes.IntegerType is applied to the column that contains "1n", then the whole table with have null values.

like image 74
Mitchan Avatar answered Jan 30 '23 05:01

Mitchan