In Scala, how does one uncompress the text contained in file.gz
so that it can be processed? I would be happy with either having the contents of the file stored in a variable, or saving it as a local file so that it can be read in by the program after.
Specifically, I am using Scalding to process compressed log data, but Scalding does not define a way to read them in FileSource.scala
.
While a text file in GZip, BZip2, and other supported compression formats can be configured to be automatically decompressed in Apache Spark as long as it has the right file extension, you must perform additional steps to read zip files.
You can unzip GZ files in Linux by adding the -d flag to the Gzip/Gunzip command. All the same flags we used above can be applied. The GZ file will be removed by default after we uncompressed it unless we use the -k flag. Below we will unzip the GZ files we compressed in the same directory.
open() This function opens a gzip-compressed file in binary or text mode and returns a file like object, which may be physical file, a string or byte object.
Here's my version:
import java.io.BufferedReader
import java.io.InputStreamReader
import java.util.zip.GZIPInputStream
import java.io.FileInputStream
class BufferedReaderIterator(reader: BufferedReader) extends Iterator[String] {
override def hasNext() = reader.ready
override def next() = reader.readLine()
}
object GzFileIterator {
def apply(file: java.io.File, encoding: String) = {
new BufferedReaderIterator(
new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
new FileInputStream(file)), encoding)))
}
}
Then do:
val iterator = GzFileIterator(new java.io.File("test.txt.gz"), "UTF-8")
iterator.foreach(println)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With