How to read gz files in Spark using wholeTextFiles

Tags:

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:

JavaRDD<<String>String> input = sc.textFile(...)

since to my understanding I do not have access to the file name this way. Instead, I used:

JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);

because this way I get a pair of file name and the content. However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.

So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)

632

asked Jun 25 '14 07:06

Yaniv Donenfeld

1 Answers

You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):

  override def createRecordReader(
      split: InputSplit,
      context: TaskAttemptContext): RecordReader[String, String] = {

    new CombineFileRecordReader[String, String](
      split.asInstanceOf[CombineFileSplit],
      context,
      classOf[WholeTextFileRecordReader])
  }

You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.

UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.

Another option would be to decompress the bytes yourself using GZipInputStream

UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.

Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus []  fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());

159

answered Nov 03 '22 00:11

aaronman

Related questions
                            
                                Integrating Spark SQL and Apache Drill through JDBC
                            
                                OpenCV library loaded in hadoop but not working
                            
                                Hadoop, MapReduce - Multiple Input/Output Paths
                            
                                Why does full outer join in HIVE gives weird result when one of the join fields is missing?
                            
                                run Spark-Submit on YARN but Imbalance (only 1 node is working)
                            
                                Real-time analysis of event logs with Elasticsearch
                            
                                hive view with nested selects and partition pruning
                            
                                AWS Data Pipeline: Tez fails on simple HiveActivity
                            
                                Hive : How to explode a JSON column with an array, and embedded in a CSV file?
                            
                                Accessing hdfs from docker-hadoop-spark--workbench via zeppelin
                            
                                Any Good Opensource Analytics front end tool? [closed]
                            
                                How do you deal with empty or missing input files in Apache Pig?
                            
                                A way to read table data from Mysql to Pig
                            
                                is there any seqFileDir option for "clusterdump" in the latest "apache mahout" library?
                            
                                sample map reduce script in python for hive produces exception
                            
                                using JSON-SerDe in Hive tables
                            
                                Extracting an Array of Structs in Hive
                            
                                Pig 0.11.1 - Count groups in a time range
                            
                                InvalidProtocolBufferException when trying to write to HDFS
                            
                                Copy and extract files from s3 to HDFS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read gz files in Spark using wholeTextFiles

Tags:

gzip

apache-spark

hadoop

Yaniv Donenfeld

People also ask

1 Answers

aaronman

Recent Activity

Donate For Us