Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Parquet Could not read footer: java.io.IOException:

I have a SPARK project running on a Cloudera VM. On my project I load the data from a parquet file and then process these data. Everything works fine but The problem is that I need to run this project on a school cluster but there I am having problems while reading the parquet file at this part of code:

DataFrame schemaRDF = sqlContext.parquetFile("/var/tmp/graphs/sib200.parquet");

I get the following error:

Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/var/tmp/graphs/sib200.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:750)

Based on the search online it seems to be a parquet version problem.

What I would like from you is to tell me how can I find the installed parquet version in a computer in order to check if both have the same version. Or in addition, if you know the exact solution for this error would also be perfect!

like image 958
Lavdërim Shala Avatar asked Jan 15 '16 15:01

Lavdërim Shala


1 Answers

I got the same problem trying to read a parquet file from S3. In my case the issue was the required libraries were not available for all workers in the cluster.

There are 2 ways to fix that:

  • Make sure you added the dependencies on the spark-submit command so it's distributed to the whole cluster
  • Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster.
like image 89
Bruno Faria Avatar answered Oct 11 '22 13:10

Bruno Faria