In Java, I'd wrap a GZIPInputStream over a FileInputStream and be done. How is the equivalent done in Scala?
Source.fromFile("a.csv.gz")....
fromFile returns a BufferedSource, which really wants to view the world as a collection of lines.
Is there no more elegant way than this?
Source.fromInputStream(new GZIPInputStream(new BufferedInputStream(new FileInputStream("a.csv.gz"))))
Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. In Spark, support for gzip input files should work the same as it does in Hadoop.
Spark document clearly specify that you can read gz file automatically: All of Spark's file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/. txt"), and textFile("/my/directory/.
GZIPInputStream(InputStream in) Creates a new input stream with a default buffer size. GZIPInputStream(InputStream in, int size) Creates a new input stream with the specified buffer size.
If you want to use Source
and not do everything the Java way, then yes, you'll have to add one more layer of wrapping to what you were doing in Java. Source
takes InputStream
s but can give you Reader
s, which prevents you from using Source
twice.
Scala is pretty good at making you never have to do more work than in Java, but especially with I/O, you often have to fall back to Java classes. (You can always define your own shortcuts, of course:
def gis(s: String) = new GZIPInputStream(new BufferedInputStream(new FileInputStream(s)))
is barely longer than what you've typed already, and now you can reuse it.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With