Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala: Reading a huge zipped text file line by line without loading into memory

I have a text file of size 100-200 GB. So I wish to store in a compressed format (such as zip). However, I need to process it one line at a time due to its size. Though it is straightforward to read a text file one line at a time with io.Source.fromFile(fileName).getLines, but that is only for unzipped files.

Is there some efficient way to read a compressed file in scala line-by-line? I couldn't find any examples but a closer implementation I saw was here but it loads file into memory. Unlike examples that are generally given that work with zip archive, I need to process only one text file that is compressed. I would be grateful for any pointers or leads.

like image 831
Quiescent Avatar asked Mar 02 '23 14:03

Quiescent


1 Answers

If the file is Gzipped, java's GzipInputStream gives you streaming access:

   val lines: Iterator[String] = Source
     .fromInputStream(new GzipInputStream(new FileInputStream("foo.gz")))
     .getLines

If it is a zip archive as your question suggests, that's more complicated. Zip archives are more like folders than individual files. You'd have to read the table of content first, and then scan through the entries to find one you want to read (or to read all of them). Something like this

like image 81
Dima Avatar answered Mar 04 '23 03:03

Dima