Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue.

The process that I did was:

1 - Run Apache Spark script to generate 30 millions rows partitioned by date at S3 stored by Orc.

2 - Run a Athena query to create the external table.

3 - Checked the table at EMR connected with Glue Data Catalog and it worked perfect. Both Spark and Hive were able to access.

4 - Generate another 30 millions rows in other folder partitioned by date. In Orc format

5 - Ran the Glue Crawler that identify the new table. Added to Data Catalog and Athena was able to do the query. But Spark and Hive aren't able to do it. See the exception below:

Spark Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcStruct

Hive Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating audit_id (state=,code=0)

I was checking if was any serialisation problem and I found this:

Table created manually (Configuration):

Input format org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

Output format org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde

orc.compress SNAPPY

Table Created with Glue Crawler:

Input format org.apache.hadoop.mapred.TextInputFormat

Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde

So, this is not working to read from Hive or Spark. It works for Athena. I already changed the configurations but with no effect at Hive or Spark.

Anyone faced that problem?

like image 269
Thiago Baldim Avatar asked Aug 18 '17 04:08

Thiago Baldim


People also ask

How does AWS Glue handle ETL errors?

Q: How does AWS Glue handle ETL errors? AWS Glue monitors job event metrics and errors, and pushes all notifications to Amazon CloudWatch. With Amazon CloudWatch, you can configure a host of actions that can be triggered based on specific notifications from AWS Glue.

How does crawler work in AWS Glue?

A crawler accesses your data store, extracts metadata, and creates table definitions in the AWS Glue Data Catalog. The Crawlers pane in the AWS Glue console lists all the crawlers that you create. The list displays status and metrics from the last run of your crawler.

Which type of data can be discover by AWS Glue?

With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

How does crawler detect schema?

When the crawler runs, the crawler uses the custom classifier that you defined to find a match in the data store. The match with each classifier generates a certainty. If the classifier returns certainty=1.0 during processing, then the crawler is 100 percent certain that the classifier can create the correct schema.


1 Answers

Well,

After few weeks that I posted this question AWS fixed the problem. As I showed above, the problem was real and that was a bug from Glue.

As it is a new product and still have some problems some times.

But this was solved properly. See the properties of the table now:

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
like image 196
Thiago Baldim Avatar answered Oct 01 '22 00:10

Thiago Baldim