Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I write a parquet file using Spark (pyspark)?

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'

from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ")  # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv")  # Displays the content of the DataFrame to stdout df.write.parquet("/output/proto.parquet") 

Do you know how to make this work?

The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

like image 452
ebertbm Avatar asked Feb 03 '17 11:02

ebertbm


People also ask

What is Parquet in PySpark?

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.


1 Answers

The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

spark = SparkSession \     .builder \     .appName("Protob Conversion to Parquet") \     .config("spark.some.config.option", "some-value") \     .getOrCreate()  # read csv df = spark.read.csv("/temp/proto_temp.csv")  # Displays the content of the DataFrame to stdout df.show()  df.write.parquet("output/proto.parquet") 
like image 57
ebertbm Avatar answered Oct 04 '22 04:10

ebertbm