Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

I have saved a remote DB table in Hive using saveAsTable method, now when i try to access the Hive table data using CLI command select * from table_name, It's giving me the error below:

2016-06-15 10:49:36,866 WARN  [HiveServer2-Handler-Pool: Thread-96]:
thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) -
Error fetching results: org.apache.hive.service.cli.HiveSQLException:
java.io.IOException: parquet.io.ParquetDecodingException: Can not read
value at 0 in block -1 in file hdfs:

Any idea what I might be doing wrong here?

like image 520
serverliving.com Avatar asked Jun 15 '16 07:06

serverliving.com


3 Answers

Problem: Facing below issue while querying the data in impyla (data written by spark job)

ERROR: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1521667682013_4868_1_00, diagnostics=[Task failed, taskId=task_1521667682013_4868_1_00_000082, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://shastina/sys/datalake_dev/venmo/data/managed_zone/integration/ACCOUNT_20180305/part-r-00082-bc0c080c-4080-4f6b-9b94-f5bafb5234db.snappy.parquet
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
    at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Root Cause:

This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

--conf "spark.sql.parquet.writeLegacyFormat=true"

References:

like image 188
Amit Kulkarni Avatar answered Nov 05 '22 05:11

Amit Kulkarni


I had a similar error (but at a positive index in a non-negative block), and it came from the fact that I had created the Parquet data with some Spark dataframe types marked as non-nullable when they were actually null.

In my case, I thus interpret the error as Spark attempting to read data from a certain non-nullable type and stumbling across an unexpected null value.

To add to the confusion, after reading the Parquet file, Spark reports with printSchema() that all the fields are nullable, whether they are or not. However, in my case, making them really nullable in the original Parquet file solved the problem.

Now, the fact that the question happens at "0 in block -1" is suspicious: it actually almost looks as if the data was not found, since block -1 looks like Spark has not even started reading anything (just a guess).

like image 22
Eric O Lebigot Avatar answered Nov 05 '22 05:11

Eric O Lebigot


It looks like a schema mismatch problem here. If you set your schema to be not nullable, and create your dataframe with None value, Spark would throw you ValueError: This field is not nullable, but got None error.

[Pyspark]

from pyspark.sql.functions import * #udf, concat, col, lit, ltrim, rtrim
from pyspark.sql.types import *  

schema = ArrayType(StructType([StructField('A', IntegerType(), nullable=False)]))
# It will throw "ValueError". 
df = spark.createDataFrame([[[None]],[[2]]],schema=schema) 
df.show()

But it is not the case if you use udf.

Using the same schema, if you use udf for transformation, it won't throw you ValueError even if your udf return a None. And it is the place where data schema mismatch happens.

For example:

df = spark.createDataFrame([[[1]],[[2]]], schema=schema)

def throw_none():
    def _throw_none(x):
        if x[0][0] == 1:
            return [['I AM ONE']] 
        else:
            return x 
    return udf(_throw_none, schema) 

# since value col only accept intergerType, it will throw null for 
# string "I AM ONE" in the first row. But spark did not throw ValueError
# error this time ! This is where data schema type mismatch happen !
df = df.select(throw_none()(col("value")).name('value'))  
df.show()  

enter image description here

Then, the following parquet write and read will throw you the parquet.io.ParquetDecodingException error.

df.write.parquet("tmp")
spark.read.parquet("tmp").collect()

So be very careful on the null value if you are using udf, return the right data type in your udf. And unless it is unnecessary, please dont set nullable=False in your StructField. Set nullable=True will solve all the problem.

like image 3
Wong Tat Yau Avatar answered Nov 05 '22 07:11

Wong Tat Yau