Spark/Scala Opening Zipped CSV Files

Tags:

I am new to Spark and Scala. We have ad event log files formatted as CSV's and then compressed using pkzip. I have seen many examples on how to decompress zipped files using Java, but how would I do this using Scala for Spark? We, ultimately, want to get, extract, and load the data from each incoming file into an Hbase destination table. Maybe this can this be done with the HadoopRDD? After this, we are going to introduce Spark streaming to watch for these files.

Thanks, Ben

965

asked Feb 18 '14 22:02

Ben

1 Answers

In Spark, provided your files have the correct filename suffix (e.g. .gz for gzipped), and it's supported by org.apache.hadoop.io.compress.CompressionCodecFactory, then you can just use

sc.textFile(path)

UPDATE: At time of writing their is a bug in Hadoop bzip2 library which means trying to read bzip2 files using spark results in weird exceptions - usually ArrayIndexOutOfBounds.

124

answered Sep 18 '22 02:09

samthebest

Related questions
                            
                                IntelliJ IDEA plugin development in other JVM languages
                            
                                Integration Test in Play Framework
                            
                                Scala: pre-initialize val while extending a class
                            
                                scala specialization - using object instead of class causes slowdown?
                            
                                Scala: bug in implicit parameter
                            
                                Can IntelliJ IDEA properly format scala.html files and how do I enable it to do so?
                            
                                How to match scala generic type?
                            
                                Why is headOption faster
                            
                                How do I resolve "error: bad symbolic reference" for dependencies using maven-scala plugin?
                            
                                Can I set cookies before returning an action in Play Framework 2?
                            
                                Scala Pickling usage MyObject -> Array[Byte] -> MyObject
                            
                                How to build a dynamic sequence in a scala macro?
                            
                                sbt does not resolve Typesafe repository
                            
                                Transform an M[A => B] to an A => M[B]
                            
                                SBT Resolvers work in build.sbt, not working in Build.scala
                            
                                Why are Scala for loops slower than logically identical while loops? [duplicate]
                            
                                Type variable can only be introduced in match if it's lower-case?
                            
                                Trait self type bound: A with B but not A with C
                            
                                Scala value class compilation fails for base type with partial-function-parameter method
                            
                                Using a scala generic class in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark/Scala Opening Zipped CSV Files

Tags:

scala

apache-spark

Ben

People also ask

1 Answers

samthebest

Recent Activity

Donate For Us