When is the data transferred when you create an external table in Hive with an S3 location?

Question

When you create an external table in Hive (on Hadoop) with an Amazon S3 source location, when is the data transferred to the local Hadoop HDFS? Is it on:

external table creation
when quires (MR jobs) are run on the external table
never (no data is ever transferred) and MR jobs read S3 data.

What are the costs incurred here for S3 reads? Is there a single cost for the transfer of data to HDFS, or is there no data transfer costs but when the MapReduce job created by Hive runs on this external table the read costs are incurred.

An example external table definition would be:

CREATE EXTERNAL TABLE mydata (key STRING, value INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '='
LOCATION 's3n://mys3bucket/';

Joe K · Accepted Answer

Map tasks will read the data directly from S3. Between the Map and Reduce steps, data will be written to the local filesystem, and between mapreduce jobs (in queries that require multiple jobs) the temporary data will be written to HDFS.

If you are concerned about S3 read costs, it might make sense to create another table that is stored on HDFS, and do a one-time copy from the S3 table to the HDFS table.

When is the data transferred when you create an external table in Hive with an S3 location?

Tags:

amazon-s3

hadoop

hive

amazon

Matt Alcock

1 Answers

Joe K

Recent Activity

Donate For Us

When is the data transferred when you create an external table in Hive with an S3 location?

Tags:

amazon-s3

hadoop

hive

amazon

Matt Alcock

1 Answers

Joe K

Related questions

Recent Activity

Donate For Us