I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.

<ul> <li> writing DataFrame to HDFS (Spark 1.6). <pre class="prettyprint"><code>df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object. </code></pre> </li> </ul> some of the format options are <code>csv</code>, <code>parquet</code>, <code>json</code> etc. <ul> <li> reading DataFrame from HDFS (Spark 1.6). <pre class="prettyprint"><code>from pyspark.sql import SQLContext sqlContext = SQLContext(sc) sqlContext.read.format('parquet').load('/path/to/file') </code></pre> </li> </ul> the format method takes argument such as <code>parquet</code>, <code>csv</code>, <code>json</code> etc.

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

1 Answers

writing DataFrame to HDFS (Spark 1.6).

df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.

some of the format options are csv, parquet, json etc.

reading DataFrame from HDFS (Spark 1.6).

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.format('parquet').load('/path/to/file')

the format method takes argument such as parquet, csv, json etc.

129

answered Oct 06 '22 18:10

rogue-one

Related questions
                            
                                How do I download Anaconda packages without "installing" them?
                            
                                Compiling & installing C executable using python's setuptools/setup.py?
                            
                                How are variables names stored and mapped internally?
                            
                                import m2m relation in django-import-export
                            
                                How do I fix a dimension error in TensorFlow?
                            
                                Idioms in python: closure vs functor vs object
                            
                                What pylint options can be specified in inline comments?
                            
                                How can I create an argparse mutually exclusive group with multiple positional parameters?
                            
                                How do you count cars in OpenCV with Python?
                            
                                How does Apache spark handle python multithread issues?
                            
                                Syntaxnet / Parsey McParseface python API
                            
                                What is the proper way of testing throttling in DRF?
                            
                                Python Profiling: What does "method 'poll' of 'select.poll' objects"?
                            
                                TensorFlow freeze_graph.py: The name 'save/Const:0' refers to a Tensor which does not exist
                            
                                Binning of data along one axis in numpy
                            
                                Selenium chromedriver 2.25 TimeoutException cannot determine loading status
                            
                                How to query an advanced search with google customsearch API?
                            
                                "pip install jq" generates errors on Mac and Windows
                            
                                Python3 does not find modules installed by pip3
                            
                                Python parallel execution with selenium

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

Tags:

python

hadoop

pyspark

hdfs

spark-dataframe

Ajg

People also ask

1 Answers

rogue-one

Recent Activity

Donate For Us