Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read ".gz" compressed file using spark DF or DS?

I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?

Details : File is csv with tab delimited.

like image 383
prady Avatar asked Mar 26 '18 11:03

prady


People also ask

Why can’t I read a GZ file in spark?

Because any unknown extension is defaulted to plain-text. The reason why you can’t read a file .gz.tmp is because Spark try to match the file extension with registered compression codecs and no codec handlers the extension .gz.tmp !!

Should I include ZZ or GZ in the filename?

It doesn't matter if you include the .z or .gz in the filename. If you don't have enough disk space to uncompress the file, or you only want to see the contents once and have the file stay compressed, you can send the contents of the file to the standard output (usually your terminal), by using the zcat command.

How to read GZ file in Linux?

How To Read Gz File In Linux? The “gunzip” file should be entered in the “Terminal” window along with pressing the space key and. The “unlimited” value. Using the “Enter” button on thegz file. A file named “example” can be unzipped. Type http://gunzip example.com in your Firefox browser.

How to read a compressed CSV file in spark?

Show activity on this post. Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):


1 Answers

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

like image 192
Shaido Avatar answered Sep 20 '22 15:09

Shaido