Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.

like image 904
Ajg Avatar asked May 31 '17 16:05

Ajg


People also ask

How do I write Pyspark DataFrame to CSV in HDFS?

In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.

How do I read a CSV file from HDFS in Pyspark?

You can read this easily with spark using csv method or by specifying format("csv") . In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv . Here is a snippet of code that can read csv.


1 Answers

  • writing DataFrame to HDFS (Spark 1.6).

    df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
    

some of the format options are csv, parquet, json etc.

  • reading DataFrame from HDFS (Spark 1.6).

    from pyspark.sql import SQLContext
    sqlContext = SQLContext(sc)
    sqlContext.read.format('parquet').load('/path/to/file') 
    

the format method takes argument such as parquet, csv, json etc.

like image 129
rogue-one Avatar answered Oct 06 '22 18:10

rogue-one