I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files <pre class="prettyprint"><code>file1.json file2.json file3.json </code></pre> And these are contained in <code>archive.tar.gz</code>. I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using <code>sqlContext.read.json("archive.tar.gz")</code> or <code>sc.textFile("archive.tar.gz")</code> results in garbled/extra output. Is there some way to handle gzipped archives containing multiple files in Spark? UPDATE Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem. I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.

A solution is given in Read whole text files from a compression in Spark . Using the code sample provided, I was able to create a <code>DataFrame</code> from the compressed archive like so: <pre class="prettyprint"><code>val jsonRDD = sc.binaryFiles("gzarchive/*"). flatMapValues(x => extractFiles(x).toOption). mapValues(_.map(decode()) val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x)) </code></pre> This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes. A better solution to the problem seems to be to convert the tar archives to Hadoop <code>SequenceFiles</code>, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.) See: A Million Little Files – Digital Digressions by Stuart Sierra.

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

Tags:

gzip

scala

apache-spark

rdd

I'm trying to create a Spark RDD from several json files compressed into a tar. For example, I have 3 files

file1.json
file2.json
file3.json

And these are contained in archive.tar.gz.

I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.

Is there some way to handle gzipped archives containing multiple files in Spark?

UPDATE

Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.

I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.

491

asked Jul 28 '16 12:07

zenofsahil

1 Answers

A solution is given in Read whole text files from a compression in Spark . Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:

val jsonRDD = sc.binaryFiles("gzarchive/*").
               flatMapValues(x => extractFiles(x).toOption).
               mapValues(_.map(decode())

val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.

A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)

See: A Million Little Files – Digital Digressions by Stuart Sierra.

176

answered Oct 17 '22 08:10

zenofsahil

Related questions
                            
                                Why won't the VisualVM Profiler profile my Scala console application?
                            
                                How to navigate up inside a HUET Zipper
                            
                                Fetch object by plain SQL query with SORM
                            
                                Scala projections in Slick for only one column
                            
                                How to handle exception with ask pattern and supervision
                            
                                Cannot prove that Unit <:< (T, U)
                            
                                Using Scala continuations with while loops
                            
                                Scala worksheet not working in Intellij
                            
                                Are imports and conditionals in Play's routes file possible?
                            
                                How do you perform blocking IO in apache spark job?
                            
                                How to fix the Product Type Inferred error from Scala's WartRemover tool
                            
                                Forwarding HTTP/REST Request to another REST server in Spray
                            
                                How to convert matrix to RDD[Vector] in spark
                            
                                Scala tree recursive fold method
                            
                                java.lang.NoSuchMethodError Jackson databind and Spark
                            
                                Throwing Exception in Foreach/Map Block
                            
                                How to use StaticQuery in Slick 3.0.0?
                            
                                Get all the classes that implements a trait in Scala using reflection
                            
                                How can I roll back an integration test with Slick 3 + Specs2?
                            
                                snakeyaml and spark results in an inability to construct objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

Tags:

gzip

scala

apache-spark

rdd

zenofsahil

People also ask

1 Answers

zenofsahil

Recent Activity

Donate For Us