I have line data in .gz compressed format. I have to read it in pyspark Following is the code snippet
rdd = sc.textFile("data/label.gz").map(func)
But I could not read the above file successfully. How do I read gz compressed file. I have found a similar question here but my current version of spark is different that the version in that question. I expect there should be some built in function as in hadoop.
You can load compressed files directly into dataframes through the spark instance, you just need to specify the compression in the path:
df = spark.read.csv("filepath/part-000.csv.gz")
You can also optionally specify if a header present or if schema needs applying too
df = spark.read.csv("filepath/part-000.csv.gz", header=True, schema=schema).
Spark document clearly specify that you can read gz
file automatically:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
I'd suggest running the following command, and see the result:
rdd = sc.textFile("data/label.gz")
print rdd.take(10)
Assuming that spark finds the the file data/label.gz
, it will print the 10 rows from the file.
Note, that the default location for a file like data/label.gz
will be in the hdfs folder of the spark-user. Is it there?
You didn't write the error message you got, but it's probably not going well for you because gzipped files are not splittable. You need to use a splittable compression codec, like bzip2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With