I have a text file of size 100-200 GB. So I wish to store in a compressed format (such as zip). However, I need to process it one line at a time due to its size. Though it is straightforward to read a text file one line at a time with io.Source.fromFile(fileName).getLines
, but that is only for unzipped files.
Is there some efficient way to read a compressed file in scala line-by-line? I couldn't find any examples but a closer implementation I saw was here but it loads file into memory. Unlike examples that are generally given that work with zip archive, I need to process only one text file that is compressed. I would be grateful for any pointers or leads.
If the file is Gzipped, java's GzipInputStream
gives you streaming access:
val lines: Iterator[String] = Source
.fromInputStream(new GzipInputStream(new FileInputStream("foo.gz")))
.getLines
If it is a zip archive as your question suggests, that's more complicated. Zip archives are more like folders than individual files. You'd have to read the table of content first, and then scan through the entries to find one you want to read (or to read all of them). Something like this
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With