I'm using the SPARK Java API to read a text file, convert it to JSON, and then apply a schema to it. The schema can vary based on a mapping table in the database, which is why I need to first convert the file to JSON so the schema mapping does not have to be in column order. Here is what I've done:
// Defined the schema (basic representation)
StructType myschema = new StructType().add("a", DataTypes.StringType, true)
.add("b", DataTypes.StringType, true)
.add("x", DataTypes.StringType, true)
.add("y", DataTypes.IntegerType, true)
.add("z", DataTypes.BooleanType, true);
//Reading a pipe delimited text file as JSON, the file has less columns than myschema
Dataset<String> data = spark.read().option("delimiter","|").option("header","true").csv(myFile).toJSON();
The above table returns something like this:
data.show(false);
|value|
+----------------------------------------+
| {"x":"name1","z":"true","y":"1234"}|
| {"x":"name2","z":"false","y":"1445"}|
| {"x":"name3","z":"true",:y":"1212"}|
My issue comes when I run this:
Dataset<Row> data_with_schema = spark.read().schema(myschema).json(data);
Because my result turns into this:
data_with_schema.show(false);
|x|y|z|
+-------+-------+-------+
|null |null |null |
|null |null |null |
|null |null |null |
I read on stackoverflow that this might be because I'm trying to cast json strings as integers. However, I tried to define the data variable as a Row Dataset instead of String Dataset but there was an Incompatible Types error. I'm not sure what the workaround is or what the real issue is.
Figured out the problem:
If there is data in the inputted file that cannot have a schema applied to it, it will return Null for ALL the data in your table. For example: "1n" is impossible to convert to integer. If a DataTypes.IntegerType is applied to the column that contains "1n", then the whole table with have null values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With