I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?
Details : File is csv with tab delimited.
Because any unknown extension is defaulted to plain-text. The reason why you can’t read a file .gz.tmp is because Spark try to match the file extension with registered compression codecs and no codec handlers the extension .gz.tmp !!
It doesn't matter if you include the .z or .gz in the filename. If you don't have enough disk space to uncompress the file, or you only want to see the contents once and have the file stay compressed, you can send the contents of the file to the standard output (usually your terminal), by using the zcat command.
How To Read Gz File In Linux? The “gunzip” file should be entered in the “Terminal” window along with pressing the space key and. The “unlimited” value. Using the “Enter” button on thegz file. A file named “example” can be unzipped. Type http://gunzip example.com in your Firefox browser.
Show activity on this post. Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
val df = spark.read.option("sep", "\t").csv("file.csv.gz")
PySpark:
df = spark.read.csv("file.csv.gz", sep='\t')
The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With