I have an s3 bucket with nearly 100k gzipped JSON files. These files are called <code>[timestamp].json</code> instead of the more sensible <code>[timestamp].json.gz</code>. I have other processes that use them so renaming is not an option and copying them is even less ideal. I am using <code>spark.read.json([pattern])</code> to read these files. If I rename the filename to contain the <code>.gz</code> this works fine, but whilst the extension is just <code>.json</code> they cannot be read. Is there any way I can tell spark that these files are gzipped?

SparkSession can read compressed json file directly, just like this: <code>val json=spark.read.json("/user/the_file_path/the_json_file.log.gz") json.printSchema()</code>

Can I tell spark.read.json that my files are gzipped?

1 Answers

SparkSession can read compressed json file directly, just like this:

val json=spark.read.json("/user/the_file_path/the_json_file.log.gz") json.printSchema()

177

answered Sep 19 '22 22:09

xuehui

Related questions
                            
                                How to load CSVs with timestamps in custom format?
                            
                                Spark-shell meaning of displayed Number on Stage
                            
                                Spark/Yarn: File does not exist on HDFS
                            
                                How to write streaming Dataset to Cassandra?
                            
                                Why is Spark not using all cores on local machine
                            
                                Running spark-submit with --master yarn-cluster: issue with spark-assembly
                            
                                What controls how much of a Spark Cluster is given to an application?
                            
                                Error when using multiple python files spark-submit
                            
                                How to get data from a specific partition in Spark RDD?
                            
                                Access to Spark from Flask app
                            
                                Number of Partitions of Spark Dataframe
                            
                                Docker Container with Apache Spark in standalone cluster mode
                            
                                How to use a subquery for dbtable option in jdbc data source?
                            
                                Why there are many spark-warehouse folders got created?
                            
                                Pass variables from Scala to Python in Databricks
                            
                                Getting labels from StringIndexer stages within pipeline in Spark (pyspark)
                            
                                How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?
                            
                                Spark streaming with python: how to add a UUID column?
                            
                                Difference between batch interval, sliding interval and window size in spark streaming
                            
                                Failed to find data source: com.mongodb.spark.sql.DefaultSource

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can I tell spark.read.json that my files are gzipped?

Tags:

apache-spark

pyspark

Hans

People also ask

1 Answers

xuehui

Recent Activity

Donate For Us