How to read Parquet file using Spark Core API?
I know using Spark SQL has some methods to read parquet file. But we cannot use Spark SQL for our projects.
Do we have to use newAPIHadoopFile
method on JavaSparkContext
to do this?
I am using Java to implement Spark Job.
Spark Read Parquet file into DataFrame Similar to write, DataFrameReader provides parquet () function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame. In this example snippet, we are reading data from an apache parquet file we have written before. val parqDF = spark. read. parquet ("/tmp/output/people.parquet")
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.
When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available. If true, data will be written in a way of Spark 1.4 and earlier.
Once you create a parquet file, you can read its content using DataFrame.read.parquet () function: The result of this query can be executed in Synapse Studio notebook. Apache Spark enables you to access your parquet files using table API. You can create external table on a set of parquet files using the following code:
Use the below code:
SparkSession spark = SparkSession.builder().master("yarn").appName("Application").enableHiveSupport().getOrCreate();
Dataset<Row> ds = spark.read().parquet(filename);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With