Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I tell spark.read.json that my files are gzipped?

I have an s3 bucket with nearly 100k gzipped JSON files.

These files are called [timestamp].json instead of the more sensible [timestamp].json.gz.

I have other processes that use them so renaming is not an option and copying them is even less ideal.

I am using spark.read.json([pattern]) to read these files. If I rename the filename to contain the .gz this works fine, but whilst the extension is just .json they cannot be read.

Is there any way I can tell spark that these files are gzipped?

like image 311
Hans Avatar asked Sep 10 '18 07:09

Hans


People also ask

Can Spark read JSON files?

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.

Which is correct code to read employee JSON JSON file in Spark?

This conversion can be done using SQLContext. read. json() on either an RDD of String or a JSON file. Spark SQL provides an option for querying JSON data along with auto-capturing of JSON schemas for both reading and writing data.


1 Answers

SparkSession can read compressed json file directly, just like this:

val json=spark.read.json("/user/the_file_path/the_json_file.log.gz") json.printSchema()

like image 177
xuehui Avatar answered Sep 19 '22 22:09

xuehui