I am able to write it into
ORC
PARQUET
directly and
TEXTFILE
AVRO
using additional dependencies from databricks.
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.10</artifactId>
<version>2.0.1</version>
</dependency>
Sample code:
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.table(hiveTableName);
df.printSchema();
DataFrameWriter writer = df.repartition(1).write();
if ("ORC".equalsIgnoreCase(hdfsFileFormat)) {
writer.orc(outputHdfsFile);
} else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) {
writer.parquet(outputHdfsFile);
} else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) {
writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile);
} else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) {
writer.format("com.databricks.spark.avro").save(outputHdfsFile);
}
Is there any way to write dataframe into hadoop SequenceFile and RCFile?
DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
Spark provides HiveContext class to access the hive tables directly in Spark. First, we need to import this class using the import statement like “from pyspark. sql import HiveContext“. Then, we can use this class to create a context for the hive and read the hive tables into Spark dataframe.
You can use void saveAsObjectFile(String path)
to save a RDD
as a SequenceFile of serialized objects. So in your case you have to to retrieve the RDD
from the DataFrame
:
JavaRDD<Row> rdd = df.javaRDD;
rdd.saveAsObjectFile(outputHdfsFile);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With