I have a Spark batch job which is executed hourly. Each run generates and stores new data in <code>S3</code> with the directory naming pattern <code>DATA/YEAR=?/MONTH=?/DATE=?/datafile</code>. After uploading the data to <code>S3</code>, I want to investigate it using <code>Athena</code>. Also, I would like to visualize them in <code>QuickSight</code> by connecting to Athena as a data source. The problem is that after each run of my Spark batch, the newly generated data stored in <code>S3</code> will not be discovered by Athena, unless I manually run the query <code>MSCK REPAIR TABLE</code>. Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?

You should be running <code>ADD PARTITION</code> instead: <pre class="prettyprint"><code>aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..." </code></pre> Which adds a the newly created partition from your <code>S3</code> location Athena leverages Hive for partitioning data. To create a table with partitions, you must define it during the <code>CREATE TABLE</code> statement. Use <code>PARTITIONED BY</code> to define the keys by which to partition data.

How to make MSCK REPAIR TABLE execute automatically in AWS Athena

2 Answers

There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?

From any of these, you should be able to fire off the following CLI command.

$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"

Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.

An example Lambda Function could be written as such:

import boto3  def lambda_handler(event, context):     bucket_name = 'some_bucket'      client = boto3.client('athena')      config = {         'OutputLocation': 's3://' + bucket_name + '/',         'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}      }      # Query Execution Parameters     sql = 'MSCK REPAIR TABLE some_database.some_table'     context = {'Database': 'some_database'}      client.start_query_execution(QueryString = sql,                                   QueryExecutionContext = context,                                  ResultConfiguration = config)

You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.

Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.

130

answered Sep 17 '22 12:09

Zerodf

You should be running ADD PARTITION instead:

aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."

Which adds a the newly created partition from your S3 location Athena leverages Hive for partitioning data. To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.

answered Sep 19 '22 12:09

Tony Marti

Related questions
                            
                                AWS vs Heroku vs something else for scalable platform?
                            
                                Is the S3 "US Standard" region the same as "us-east-1" in EC2?
                            
                                AWS Cognito: Best practice to handle same user (with same email address) signing in from different identity providers (Google, Facebook)
                            
                                SSH fingerprint verification for Amazon AWS EC2 server with ECDSA?
                            
                                Restricting access to CloudFront by IP
                            
                                AWS: how to fix S3 event replacing space with '+' sign in object key names in json
                            
                                AWS IAM: Unable to create additional Access Key
                            
                                How to cp file only if it does not exist, throw error otherwise?
                            
                                How could I use aws lambda to write file to s3 (python)?
                            
                                Get AWS Account ID from Boto
                            
                                How can I deduce the AWS Account ID from available BasicAWSCredentials?
                            
                                S3 bucket policy vs access control list
                            
                                Is there a way to export an AWS CLI Profile to Environment Variables?
                            
                                Cloudwatch failedinvocation error no logs available [closed]
                            
                                Add swap memory with ansible
                            
                                Connect to MySQL on AWS from local machine
                            
                                How to export a dynamodb table as a csv through aws-cli ( without using pipeline)
                            
                                Using Amazon SES with Rails ActionMailer
                            
                                Reducing memory consumption of mysql on ubuntu@aws micro instance
                            
                                conda environment to AWS Lambda

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make MSCK REPAIR TABLE execute automatically in AWS Athena

Tags:

amazon-web-services

amazon-s3

hive

amazon-athena

amazon-quicksight

YangZhao

People also ask

2 Answers

Zerodf

Tony Marti

Recent Activity

Donate For Us