I have a SPARK project running on a Cloudera VM. On my project I load the data from a parquet file and then process these data. Everything works fine but The problem is that I need to run this project on a school cluster but there I am having problems while reading the parquet file at this part of code:
DataFrame schemaRDF = sqlContext.parquetFile("/var/tmp/graphs/sib200.parquet");
I get the following error:
Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/var/tmp/graphs/sib200.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:750)
Based on the search online it seems to be a parquet version problem.
What I would like from you is to tell me how can I find the installed parquet version in a computer in order to check if both have the same version. Or in addition, if you know the exact solution for this error would also be perfect!
I got the same problem trying to read a parquet file from S3. In my case the issue was the required libraries were not available for all workers in the cluster.
There are 2 ways to fix that:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With