Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a JSON file to parquet using Apache Spark?

I am new to Apache Spark 1.3.1. How can I convert a JSON file to Parquet?

like image 629
odbhut.shei.chhele Avatar asked Jan 12 '16 10:01

odbhut.shei.chhele


People also ask

Can I convert JSON to Parquet?

Yes, we can convert the CSV/JSON files to Parquet using AWS Glue. But this is not only the use case. You can convert to the below formats.

Which of the following transforms the JSON data to a parquet file?

The JSON file is converted to CSV file using "dataframe. write. csv("path")" function. The JSON file is converted to Parquet file using the "spark.

How do I read a JSON file in Spark?

Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset<Row> . This conversion can be done using SparkSession. read(). json() on either a Dataset<String> , or a JSON file.


1 Answers

Spark 1.4 and later

You can use sparkSQL to read first the JSON file into an DataFrame, then writing the DataFrame as parquet file.

val df = sqlContext.read.json("path/to/json/file")
df.write.parquet("path/to/parquet/file")

or

df.save("path/to/parquet/file", "parquet")

Check here and here for examples and more details.

Spark 1.3.1

val df = sqlContext.jsonFile("path/to/json/file")
df.saveAsParquetFile("path/to/parquet/file")

Issue related to Windows and Spark 1.3.1

Saving a DataFrame as a parquet file on Windows will throw a java.lang.NullPointerException, as described here.

In that case, please consider to upgrade to a more recent Spark version.

like image 79
Rami Avatar answered Sep 25 '22 08:09

Rami