I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write' <pre class="prettyprint"><code>from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ") # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv") # Displays the content of the DataFrame to stdout df.write.parquet("/output/proto.parquet") </code></pre> Do you know how to make this work? The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

The error was due to the fact that the <code>textFile</code> method from <code>SparkContext</code> returned an <code>RDD</code> and what I needed was a <code>DataFrame</code>. SparkSession has a <code>SQLContext</code> under the hood. So I needed to use the <code>DataFrameReader</code> to read the CSV file correctly before converting it to a parquet file. <pre class="prettyprint"><code>spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() # read csv df = spark.read.csv("/temp/proto_temp.csv") # Displays the content of the DataFrame to stdout df.show() df.write.parquet("output/proto.parquet") </code></pre>

How can I write a parquet file using Spark (pyspark)?

Tags:

python

pyspark

spark-dataframe

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'

from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ")  # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv")  # Displays the content of the DataFrame to stdout df.write.parquet("/output/proto.parquet")

Do you know how to make this work?

The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

452

asked Feb 03 '17 11:02

ebertbm

1 Answers

The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

spark = SparkSession \     .builder \     .appName("Protob Conversion to Parquet") \     .config("spark.some.config.option", "some-value") \     .getOrCreate()  # read csv df = spark.read.csv("/temp/proto_temp.csv")  # Displays the content of the DataFrame to stdout df.show()  df.write.parquet("output/proto.parquet")

answered Oct 04 '22 04:10

ebertbm

Related questions
                            
                                Why is the id of a Python class not unique when called quickly?
                            
                                How to use alter_column in alembic?
                            
                                Multiple files for one argument in argparse Python 2.7
                            
                                pandas left join and update existing column
                            
                                django setting environment variables in unittest tests
                            
                                Efficiently count word frequencies in python
                            
                                How can I use PyCharm to locally debug a Celery worker? [duplicate]
                            
                                Inconsistency between $ and ^ in regex when using start/end arguments to re.search?
                            
                                find intersection point of two lines drawn using houghlines opencv
                            
                                cannot import name 'mydb' from partially initialized module 'connection' in Python
                            
                                How can I listen for 'usb device inserted' events in Linux, in Python?
                            
                                python: get the print output in an exec statement
                            
                                Python Django Templates and testing if a variable is null or empty string
                            
                                An efficient way of making a large random bytearray
                            
                                How do I exit program in try/except?
                            
                                argparse with required subcommands
                            
                                AttributeError: 'DataFrame' object has no attribute
                            
                                pylint duplicate code false positive
                            
                                How can I solve system of linear equations in SymPy?
                            
                                how to get redirect url using python requests [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I write a parquet file using Spark (pyspark)?

Tags:

python

pyspark

spark-dataframe

ebertbm

People also ask

1 Answers

ebertbm

Recent Activity

Donate For Us