Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing Spark DataFrame to Hive table through AWS Glue Data Cataloug

I'm using Spark 2.4.0 on EMR and trying to store simple Dataframe in s3 using AWS Glue Data Catalog. The code is below:

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.testtableemr")

Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.testtableemr. The issue I got is: although it works correctly it still throws below exception

scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.testtableemr")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
  at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
  ... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
  at org.apache.hadoop.fs.Path.<init>(Path.java:175)
  at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
  at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
  at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
  at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
  at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
  at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
  at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
  at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 74 more

I got same error using insertInto method:

filtered.repartition(1).write.mode("append").insertInto("emrdb.testtableemr")

Can you please help me understand meaning of this Exception in this context and suggest way how to fix this?

Thanks in advance!

Regards Andrzej

like image 871
awenclaw Avatar asked Jan 30 '19 12:01

awenclaw


2 Answers

The issue is happening because it is missing s3 path in your dataframe writer statement. Passing s3 path as shown below will fix this issue.

val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.option("path","s3://testbucket/testpath/").mode("append").saveAsTable("emrdb.testtableemr")
like image 186
Prabhakar Reddy Avatar answered Oct 11 '22 14:10

Prabhakar Reddy


I had a similar issue and my solution was to go to the settings of the database in Glue Data Catalog ('emrdb' in this case for you) and add a Location URI (it can be a random one). Then I can create tables without specifying this LOCATION as I was doing before without Glue Data Catalog and everything is working fine.

It is described in the official documentation: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html

Having a default database without a location URI causes failures when you create a table. As a workaround, use the LOCATION clause to specify a bucket location, such as s3://mybucket, when you use CREATE TABLE. Alternatively create tables within a database other than the default database.

like image 32
Pierre Avatar answered Oct 11 '22 13:10

Pierre