SPARK dataframe returning null when trying to apply schema to JSON data

Question

I'm using the SPARK Java API to read a text file, convert it to JSON, and then apply a schema to it. The schema can vary based on a mapping table in the database, which is why I need to first convert the file to JSON so the schema mapping does not have to be in column order. Here is what I've done:

// Defined the schema (basic representation)
StructType myschema = new StructType().add("a", DataTypes.StringType, true)
                                      .add("b", DataTypes.StringType, true)
                                      .add("x", DataTypes.StringType, true)
                                      .add("y", DataTypes.IntegerType, true)
                                      .add("z", DataTypes.BooleanType, true);

//Reading a pipe delimited text file as JSON, the file has less columns than myschema
Dataset<String> data = spark.read().option("delimiter","|").option("header","true").csv(myFile).toJSON();

The above table returns something like this:

data.show(false);

|value|
+----------------------------------------+
|      {"x":"name1","z":"true","y":"1234"}|
|      {"x":"name2","z":"false","y":"1445"}|
|      {"x":"name3","z":"true",:y":"1212"}|

My issue comes when I run this:

Dataset<Row> data_with_schema = spark.read().schema(myschema).json(data);

Because my result turns into this:

data_with_schema.show(false);
|x|y|z|
+-------+-------+-------+
|null  |null  |null  |
|null  |null  |null  |
|null  |null  |null  |

I read on stackoverflow that this might be because I'm trying to cast json strings as integers. However, I tried to define the data variable as a Row Dataset instead of String Dataset but there was an Incompatible Types error. I'm not sure what the workaround is or what the real issue is.

Mitchan · Accepted Answer

Figured out the problem:

If there is data in the inputted file that cannot have a schema applied to it, it will return Null for ALL the data in your table. For example: "1n" is impossible to convert to integer. If a DataTypes.IntegerType is applied to the column that contains "1n", then the whole table with have null values.

SPARK dataframe returning null when trying to apply schema to JSON data

Tags:

java

json

dataframe

schema

apache-spark

Mitchan

1 Answers

Mitchan

Recent Activity

Donate For Us

SPARK dataframe returning null when trying to apply schema to JSON data

Tags:

java

json

dataframe

schema

apache-spark

Mitchan

1 Answers

Mitchan

Related questions

Recent Activity

Donate For Us