Spark: Ignoring or handling DataSet select errors

Question

I'm testing some prototype application. We have json data with nested fields. I'm trying to pull some field using following json and code:

Feed: {name: "test",[Record: {id: 1 AllColumns: {ColA: "1",ColB: "2"}}...]}

Dataset<Row> completeRecord = sparkSession.read().json(inputPath);
final Dataset<Row> feed = completeRecord.select(completeRecord.col("Feed.Record.AllColumns"));

I have around 2000 files with such records. I have tested some files individually and they are working fine. But for some file I am getting below error on second line:

org.apache.spark.sql.AnalysisException: Can't extract value from Feed#8.Record: need struct type but got string;

I'm not sure what is going on here. But I would like to either handle this error gracefully and log which file has that record. Also, is there any way to ignore this and continue with rest of the files?

webdev · Accepted Answer

Answering my own question based on what I have learned. There are couple of ways to solve it. Spark provides options to ignore corrupt files and corrupt records.

To ignore corrupt files one can set following flag to true:

spark.sql.files.ignoreCorruptFiles=true

For more fine grained control and to ignore bad records instead of ignoring the complete file. You can use one of three modes that Spark api provides.

According to DataFrameReader api

mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.

PERMISSIVE mode worked really well for me but when I provided my own schema Spark filled missing attributes with null instead of marking it corrupt record.

Spark: Ignoring or handling DataSet select errors

Tags:

java

apache-spark

apache-spark-sql

webdev

1 Answers

webdev

Recent Activity

Donate For Us

Spark: Ignoring or handling DataSet select errors

Tags:

java

apache-spark

apache-spark-sql

webdev

1 Answers

webdev

Related questions

Recent Activity

Donate For Us