I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.
In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.
You can read this easily with spark using csv method or by specifying format("csv") . In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv . Here is a snippet of code that can read csv.
writing DataFrame to HDFS (Spark 1.6).
df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
some of the format options are csv
, parquet
, json
etc.
reading DataFrame from HDFS (Spark 1.6).
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.format('parquet').load('/path/to/file')
the format method takes argument such as parquet
, csv
, json
etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With