I can read a json file into a dataframe in Pyspark using
spark = SparkSession.builder.appName('GetDetails').getOrCreate()
df = spark.read.json("path to json file")
However, when i try to read a bz2(compressed csv) into a dataframe it gives me an error. I am using:
spark = SparkSession.builder.appName('GetDetails').getOrCreate()
df = spark.read.load("path to bz2 file")
Could you please help correct me?
Open() This function opens a bzip2 compressed file and returns a file object. The file can be opened as binary/text mode with read/write permission. The function performs compression based on compressionlevel argument between 1 to 9.
How to Open a BZ2 File. BZ2 files can be opened with 7-Zip and other compression/decompression programs. Of them, PeaZip is a good choice because it fully supports the format. This means it can open the file as well as compress one using the BZIP2 compression method to make a BZ2 file.
The method spark.read.load()
has an optional parameter format
which by default is 'parquet'.
So, for your code to work it should look like this:
df = spark.read.load("data.json.bz2", format="json")
Also, spark.read.json
will perfectly work for compressed JSON files, e.g.:
df = spark.read.json("data.json.bz2")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With