I'm testing some prototype application. We have json data with nested fields. I'm trying to pull some field using following json and code:
Feed: {name: "test",[Record: {id: 1 AllColumns: {ColA: "1",ColB: "2"}}...]}
Dataset<Row> completeRecord = sparkSession.read().json(inputPath);
final Dataset<Row> feed = completeRecord.select(completeRecord.col("Feed.Record.AllColumns"));
I have around 2000 files with such records. I have tested some files individually and they are working fine. But for some file I am getting below error on second line:
org.apache.spark.sql.AnalysisException: Can't extract value from Feed#8.Record: need struct type but got string;
I'm not sure what is going on here. But I would like to either handle this error gracefully and log which file has that record. Also, is there any way to ignore this and continue with rest of the files?
Answering my own question based on what I have learned. There are couple of ways to solve it. Spark provides options to ignore corrupt files and corrupt records.
To ignore corrupt files one can set following flag to true:
spark.sql.files.ignoreCorruptFiles=true
For more fine grained control and to ignore bad records instead of ignoring the complete file. You can use one of three modes that Spark api provides.
According to DataFrameReader api
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
PERMISSIVE mode worked really well for me but when I provided my own schema Spark filled missing attributes with null instead of marking it corrupt record.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With