I have an s3 bucket with nearly 100k gzipped JSON files.
These files are called [timestamp].json
instead of the more sensible [timestamp].json.gz
.
I have other processes that use them so renaming is not an option and copying them is even less ideal.
I am using spark.read.json([pattern])
to read these files. If I rename the filename to contain the .gz
this works fine, but whilst the extension is just .json
they cannot be read.
Is there any way I can tell spark that these files are gzipped?
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.
This conversion can be done using SQLContext. read. json() on either an RDD of String or a JSON file. Spark SQL provides an option for querying JSON data along with auto-capturing of JSON schemas for both reading and writing data.
SparkSession can read compressed json file directly, just like this:
val json=spark.read.json("/user/the_file_path/the_json_file.log.gz")
json.printSchema()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With