Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create managed hive table with specified location through Spark SQL?

I want to create managed table with location on AWS S3 through spark sql, but if I specify the location it creates EXTERNAL table even if I didn't specify this keyword.

CREATE TABLE IF NOT EXISTS database.tableOnS3(name string)
LOCATION 's3://mybucket/';

Why do they imply EXTERNAL keyword here...

If I execute this query in hive console it's creating managed table, so how to do the same in spark?

like image 709
Boris Mitioglov Avatar asked Oct 18 '25 22:10

Boris Mitioglov


2 Answers

change the tableType to MANAGED once the External table is created.

import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTableType

val identifier = TableIdentifier(yourTableName, Some(yourDatabaseName)     
spark.sessionState.catalog.alterTable(spark.sessionState.catalog.getTableMetadata(identifier).copy(tableType = CatalogTableType.MANAGED))
like image 167
sasikala d Avatar answered Oct 20 '25 17:10

sasikala d


See docs Hive fundamentally knows two different types of tables:

Managed (Internal)
External


Managed tables : A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /user/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.

Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.

External tables : An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.

Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.

Conclusion :


since you are using s3 location which is external its showing like that.

Further you want to understand how code works see CreateTableLikeCommand : in this val tblType = if (location.isEmpty) CatalogTableType.MANAGED else CatalogTableType.EXTERNAL is where it dynamically decides...

/**
 * A command to create a table with the same definition of the given existing table.
 * In the target table definition, the table comment is always empty but the column comments
 * are identical to the ones defined in the source table.
 *
 * The CatalogTable attributes copied from the source table are storage(inputFormat, outputFormat,
 * serde, compressed, properties), schema, provider, partitionColumnNames, bucketSpec.
 *
 * The syntax of using this command in SQL is:
 * {{{
 *   CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
 *   LIKE [other_db_name.]existing_table_name [locationSpec]
 * }}}
 */
case class CreateTableLikeCommand(
    targetTable: TableIdentifier,
    sourceTable: TableIdentifier,
    location: Option[String],
    ifNotExists: Boolean) extends RunnableCommand {

  override def run(sparkSession: SparkSession): Seq[Row] = {
    val catalog = sparkSession.sessionState.catalog
    val sourceTableDesc = catalog.getTempViewOrPermanentTableMetadata(sourceTable)

    val newProvider = if (sourceTableDesc.tableType == CatalogTableType.VIEW) {
      Some(sparkSession.sessionState.conf.defaultDataSourceName)
    } else {
      sourceTableDesc.provider
    }

    // If the location is specified, we create an external table internally.
    // Otherwise create a managed table.
    val tblType = if (location.isEmpty) CatalogTableType.MANAGED else CatalogTableType.EXTERNAL

    val newTableDesc =
      CatalogTable(
        identifier = targetTable,
        tableType = tblType,
        storage = sourceTableDesc.storage.copy(
          locationUri = location.map(CatalogUtils.stringToURI(_))),
        schema = sourceTableDesc.schema,
        provider = newProvider,
        partitionColumnNames = sourceTableDesc.partitionColumnNames,
        bucketSpec = sourceTableDesc.bucketSpec)

    catalog.createTable(newTableDesc, ifNotExists)
    Seq.empty[Row]
  }
}

Update : If I execute this query in hive console it's creating managed table, so how to do the same in spark?

hope you are using same local location(not different vpc) where hive and spark co-exists. if so then set

spark.sql.warehouse.dir=hdfs:///... to s3 location

using spark conf.... you may need to set access key and secret id credentials to spark config object for creating spark session.


like image 26
Ram Ghadiyaram Avatar answered Oct 20 '25 16:10

Ram Ghadiyaram



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!