Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the steps to use Redshift Spectrum.?

Currently I am using Amazon Redshift as well as Amazon S3 to store data. Now I want to use Spectrum to improve performance but confused in how to use it properly.

If I am using SQL workbench can I create external schema from same or I need to create it from AWS console or Athena.?

Do I need to have Athena for a specific region.? Is it possible to use spectrum without Athena.?

Now if I try to create external schema through SQL workbench it was throwing an error "CREATE EXTERNAL SCHEMA is not enabled" How can enable this..?

Please help if someone had used Spectrum and let me know detailed steps to use spectrum.

like image 517
Pratik Rawlekar Avatar asked Jun 20 '17 07:06

Pratik Rawlekar


People also ask

When can I use Amazon Redshift spectrum?

Redshift Spectrum can be used in combination with other AWS computing services that have direct access to S3, such as Amazon Athena, Amazon Elastic Map Reduce for Apache Spark, Apache Hive, and Presto. Amazon Redshift Spectrum is a great tool for easily executing complex SQL queries against data stored in Amazon S3.

What is the difference between Redshift and Redshift spectrum?

Redshift Spectrum is an extension of Amazon Redshift. The service allows data analysts to run queries on data stored in S3. It makes it possible, for instance, to join data in external tables with data stored in Amazon Redshift to run complex queries.

What protocol does Redshift use?

It is protocol-compatible with PostgreSQL and is available through JDBC/ODBC, opening it up to a huge range of existing SQL tools. Redshift uses columnar storage, which means it physically stores the data for each column in contiguous blocks, as opposed to regular transactional databases that store data in rows.

What is S3 spectrum?

Spectrum is a Redshift component that allows you to query files stored in Amazon S3. Your team can narrow its search by querying only the necessary columns for your analysis.


1 Answers

Redshift Spectrum requires an external data catalog that contains the definition of the table. It is this data catalog that contains the reference to the files in S3, rather than the external table definition in Redshift. This data catalog can be defined in Elastic MapReduce as a Hive Catalog (good if you have an existing EMR deployment) or in Athena (good if you don't have EMR or don't want to get into managing Hadoop). The Athena route can be managed fully by Redshift, if you wish.

It looks to me like your issue is one of four things. Either:

  1. Your Redshift cluster is not in an AWS region that currently supports Athena and Spectrum.
  2. Your Redshift cluster version doesn't support Spectrum yet (1.0.1294 or later).
  3. Your IAM policies don't allow Redshift control over Athena.
  4. You're not using the CREATE EXTERNAL DATABASE IF NOT EXISTS parameter on your CREATE EXTERNAL SCHEMA statement.

To allow Redshift to manage Athena you'll need to attach an IAM policy to your Redshift cluster that allows it Full Control over Athena, as well as Read access to the S3 bucket containing your data.

Once that's in place, you can create your external schema as you have been already, ensuring that the CREATE EXTERNAL DATABASE IF NOT EXISTS argument is also passed. This makes sure that the external database is created in Athena if you don't have a pre-existing configuration: http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-external-table.html

Finally, run your CREATE EXTERNAL TABLE statement, which will transparently create the table metadata in the Athena data catalog: http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html

like image 117
GShenanigan Avatar answered Nov 15 '22 04:11

GShenanigan