I would like to create a database in Athena via API. I have parquet files in S3 that I would like to query using the API and I would like to use Athena for the query.
Anyway I can create a database via API for Athena?
Today, Amazon Athena announced enhancements to its console and API which provide more flexibility when using parameterized queries. You can now run parameterized queries directly from the Athena console and an enhanced API that no longer requires you to prepare SQL statements in advance.
You can get the CREATE TABLE DDL statement from Athena, by calling StartQueryExecution() from your code, waiting for the query to complete and then downloading the results file or using the GetQueryResults() API. Athena uses the Glue Data Catalog as a Hive metastore.
Athena is not a database but rather a query engine. This means that: Compute and storage are separate: databases both store data in rest, and provision the resources needed in order to perform queries and calculations. Each of these comes with direct and indirect overheads.
Creating a database in Athena can be done by creating your own API request or using the SDK.
Here is a Python example using the SDK:
import boto3
client = boto3.client('athena')
config = {'OutputLocation': 's3://TEST_BUCKET/'}
client.start_query_execution(
QueryString = 'create database TEST_DATABASE',
ResultConfiguration = config
)
There are SDKs available for Java, .NET, Node, PHP, Python, Ruby, Go, and C++. If you want to create your own API requests, I recommend developing a good understanding of the signing process. You could also use the AWS CLI as such:
$ aws athena start-query-execution --query-string "CREATE database ATHENA_TEST_TWO" --result-configuration "OutputLocation=s3://TEST_BUCKET/"
Once you have a database created, you can then pass the database name in your query requests.
context = {'Database': 'TEST_DATABASE'}
client.start_query_execution(QueryString='CREATE TABLE ...',
QueryExecutionContext = context,
ResultConfiguration=config)
To see some DDL creating a table from Parquet files see the following examples on the Amazon Athena User Guide.
Edit In reponse to @condo1234's questions:
But how do I associated a database with a file in S3?
The short answer is you don't. You associate a table with files sharing a prefix in a bucket in S3.
For example, say I want to create a table to analyze data held in s3://TEST_BUCKET
. Through the AWS Console, I can use the poorly named "Create Folder" button to create a prefix called one-table-many-files/
. I then created two csv files:
f1.csv
Codd,1923
Ellison,1944
Chamberlin,1944
Boyce,1947
f2.csv
Hopper,1906
Floyd,1953
Moriarty Wolf Chambers,1980
I then uploaded these text files to the example bucket/prefix combination s3://TEST_BUCKET/one-table-many-files/
I ran the following DDL:
CREATE EXTERNAL TABLE php_test.computer_scientists (
name string,
year_born int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://TEST_BUCKET/one-table-many-files/';
And ran the following SQL Statement:
SELECT * FROM php_test.computer_scientists;
I got the following results back, with data from both files in the bucket + prefix combination specified in the DDL.
"name","year_born"
"Hopper","1906"
"Floyd","1953"
"Moriarty Wolf Chambers","1980"
"Codd","1923"
"Ellison","1944"
"Chamberlin","1944"
"Boyce","1947"
Notice that I am using the word "prefix" and not "folder"? That is because S3 has no concept of a folder! These prefixes are useful however, as they allow for Athena Partitioning.
Per your request, here is a php example as well.
<?php
print('Welcome to PHP');
require 'aws-autoloader.php';
$athena = new Aws\Athena\AthenaClient(['version' => 'latest', 'region' => 'us-east-1' ]);
$athena->StartQueryExecution([
'QueryString' => 'CREATE DATABASE php_test;',
'ResultConfiguration' => [
'OutputLocation' => 's3://TEST_BUCKET/', // REQUIRED
],
]);
?>
See the PHP SDK Documentation for more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With