I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. The process that I did was: 1 - Run Apache Spark script to generate 30 millions rows partitioned by date at S3 stored by Orc. 2 - Run a Athena query to create the external table. 3 - Checked the table at EMR connected with Glue Data Catalog and it worked perfect. Both Spark and Hive were able to access. 4 - Generate another 30 millions rows in other folder partitioned by date. In Orc format 5 - Ran the Glue Crawler that identify the new table. Added to Data Catalog and Athena was able to do the query. But Spark and Hive aren't able to do it. See the exception below: Spark <code>Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcStruct</code> Hive <code>Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating audit_id (state=,code=0)</code> I was checking if was any serialisation problem and I found this: <h3>Table created manually (Configuration):</h3> Input format org.apache.hadoop.hive.ql.io.orc.OrcInputFormat Output format org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde orc.compress SNAPPY <h3>Table Created with Glue Crawler:</h3> Input format org.apache.hadoop.mapred.TextInputFormat Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde So, this is not working to read from Hive or Spark. It works for Athena. I already changed the configurations but with no effect at Hive or Spark. Anyone faced that problem?

Well, After few weeks that I posted this question AWS fixed the problem. As I showed above, the problem was real and that was a bug from Glue. As it is a new product and still have some problems some times. But this was solved properly. See the properties of the table now: <pre class="prettyprint"><code>ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' </code></pre>

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

Table created manually (Configuration):

Input format org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

Output format org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde

orc.compress SNAPPY

Table Created with Glue Crawler:

Input format org.apache.hadoop.mapred.TextInputFormat

Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde

So, this is not working to read from Hive or Spark. It works for Athena. I already changed the configurations but with no effect at Hive or Spark.

Anyone faced that problem?

269

asked Aug 18 '17 04:08

Thiago Baldim

1 Answers

Well,

After few weeks that I posted this question AWS fixed the problem. As I showed above, the problem was real and that was a bug from Glue.

As it is a new product and still have some problems some times.

But this was solved properly. See the properties of the table now:

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

196

answered Oct 01 '22 00:10

Thiago Baldim

Related questions
                            
                                Yum repositories don't work unless there are exceptions in the AWS firewall. How do I make the exceptions based on a DNS name?
                            
                                An IP address of EC2 instance gets changed after the restart
                            
                                Amazon Aurora 1.8 Load Data From S3 - Cannot Instantiate S3 Client
                            
                                How to Connect EC2 Instance with VSCode Directy using pem file in SFTP
                            
                                Amazon Product Advertising API Signature in iOS
                            
                                Need a complete example for DynamoDB with php
                            
                                How to Remove Delete Markers from Multiple Objects on Amazon S3 at once
                            
                                Average EC2 Uptime?
                            
                                How to consume Amazon SWF [closed]
                            
                                AWS S3 website versioning
                            
                                Is there a way to specify a default configuration set on Amazon SES?
                            
                                Low InnoDB Writes per Second - AWS EC2 to MySQL RDS using Python
                            
                                AWS Mobile Push with users that may be logged into multiple devices
                            
                                Elastic Beanstalk CloudWatch Log streaming stops working – How to debug
                            
                                Access AWS athena through JPA spring boot
                            
                                Let only registered users download from my Amazon S3 bucket
                            
                                Celery connection drop with AWS ELB and RabbitMQ
                            
                                Using AWS Cognito can I resolve the authenticated IdentityId given a disabled unauthenticated IdentityId?
                            
                                Private cloud GPU virtualization similar to Amazon Web Services Cluster GPU instances
                            
                                django-admin.py and python path on EC2 Amazon Beanstalk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

Tags:

amazon-web-services

amazon-s3

apache-spark

amazon-emr

aws-glue

Table created manually (Configuration):

Table Created with Glue Crawler:

Thiago Baldim

People also ask

1 Answers

Thiago Baldim

Recent Activity

Donate For Us