i am trying to do select * from db.abc in hive,this hive table was loaded using spark
it does not work shows an error:
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)
when i use the following properties i was able to query for hive:
set hive.mapred.mode=nonstrict;
set hive.optimize.ppd=true;
set hive.optimize.index.filter=true;
set hive.tez.bucket.pruning=true;
set hive.explain.user=false;
set hive.fetch.task.conversion=none;
now when i try to read the same hive table db.abc using spark , i am recieving the error as below:
Clients can access this table only if they have the following capabilities: CONNECTORREAD,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE. This table may be a Hive-managed ACID table, or require some other capability that Spark currently does not implement; at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.throwIfNoAccess(ExternalCatalogUtils.scala:280) at org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:105) at org.apache.spark.sql.hive.HiveTranslationLayerCheck$$anonfun$apply$1.applyOrElse(HiveTranslationLayerStrategies.scala:85) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:85) at org.apache.spark.sql.hive.HiveTranslationLayerCheck.apply(HiveTranslationLayerStrategies.scala:83) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:124) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
do i need to add any properties in spark-submit or shell ? or what is the alternate way to read this hiv e table using spark
hive table sample format:
CREATE TABLE `hive``( |
| `c_id` decimal(11,0),etc.........
ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| WITH SERDEPROPERTIES (
STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
LOCATION |
| path= 'hdfs://gjuyada/bbts/scl/raw' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'spark.sql.create.version'='2.3.2.3.1.0.0-78', |
| 'spark.sql.sources.provider'='orc', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":
[{"name":"Czz_ID","type":"decimal(11,0)","nullable":true,"metadata":{}},
{"name":"DzzzC_CD","type":"string","nullable":true,"metadata":{}},
{"name":"C0000_S_N","type":"decimal(11,0)","nullable":true,"metadata":{}},
{"name":"P_ _NB","type":"decimal(11,0)","nullable":true,"metadata":{}},
{"name":"C_YYYY","type":"string","nullable":true,"metadata":{}},"type":"string","nullable":true,"metadata":{}},{"name":"Cv_ID","type":"string","nullable":true,"metadata":{}},
| 'transactional'='true', |
| 'transient_lastDdlTime'='1574817059')
The issue you are trying to reading Transactional table
(transactional = true)
into Spark.
Officially
Spark
not yet supported for Hive-ACID table, get afull dump/incremental dump of acid table
to regularhive orc/parquet
partitioned table then read the data using spark.
There is a Open Jira SPARK-15348 to add support for reading Hive ACID
table.
If you run major compaction
on Acid table(from hive) then spark able to read base_XXX
directories only but not delta directories SPARK-16996 addressed in this jira.
There are some workaround to read acid tables using SPARK-LLAP as mentioned in this link.
I think starting from HDP-3.X
HiveWareHouseConnector is able to support to read HiveAcid tables.
You can create an snapshot
of the transactional table as non transactional
and then read the data from the table.
**`create table <non_trans> stored as orc as select * from <transactional_table>`**
UPDATE:
1.Create an external hive table:
CREATE external TABLE `<ext_tab_name>`(
<col_name> <data_type>....etc
)
stored as orc
location '<path>';
2.Then overwrite to the above external table with existing transactional table data.
insert overwrite table <ext_tab_name> select * from <transactional_tab_name>;
You need to create table with TBLPROPERTIES ( 'transactional'='false' )
CREATE TABLE your_table(
`col` string,
`col2` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
'transactional'='false'
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With