How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?

Tags:

I am able to write it into

ORC
PARQUET

directly and
TEXTFILE
AVRO

using additional dependencies from databricks.

    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-csv_2.10</artifactId>
        <version>1.5.0</version>
    </dependency>
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-avro_2.10</artifactId>
        <version>2.0.1</version>
    </dependency>

Sample code:

    SparkContext sc = new SparkContext(conf);
    HiveContext hc = new HiveContext(sc);
    DataFrame df = hc.table(hiveTableName);
    df.printSchema();
    DataFrameWriter writer = df.repartition(1).write();

    if ("ORC".equalsIgnoreCase(hdfsFileFormat)) {
        writer.orc(outputHdfsFile);

    } else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) {
        writer.parquet(outputHdfsFile);

    } else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) {
        writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile);

    } else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) {
        writer.format("com.databricks.spark.avro").save(outputHdfsFile);
    }

Is there any way to write dataframe into hadoop SequenceFile and RCFile?

582

asked Oct 03 '16 11:10

Dev

1 Answers

You can use void saveAsObjectFile(String path) to save a RDD as a SequenceFile of serialized objects. So in your case you have to to retrieve the RDD from the DataFrame:

JavaRDD<Row> rdd = df.javaRDD;
rdd.saveAsObjectFile(outputHdfsFile);

196

answered Nov 30 '22 20:11

nicoring

Related questions
                            
                                How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?
                            
                                How to pass array column as argument in VectorUdf in .Net Spark?
                            
                                How to read gz files in Spark using wholeTextFiles
                            
                                How to submit Apache Spark job to Hadoop YARN on Azure HDInsight
                            
                                Apache Spark network ports configuration
                            
                                Spark give Null pointer exception during InputSplit for Hbase
                            
                                java.lang.StackOverflowError when using Kryo to serialize objects with references to each other
                            
                                In Spark Streaming, how to detect for an empty batch?
                            
                                Spark Streaming Bug - Window of Windowed DStream does not work
                            
                                Getting java.lang.IllegalArgumentException: requirement failed while calling Sparks MLLIB StreamingKMeans from java application
                            
                                Batch Size in Spark Streaming
                            
                                Partitions not being pruned in simple SparkSQL queries
                            
                                Multiple windows of different durations in Spark Streaming application
                            
                                Failed to load class for data source: com.databricks.spark.csv
                            
                                Spark JoinWithCassandraTable on TimeStamp partition key STUCK
                            
                                Using TestHiveContext/HiveContext in unit tests
                            
                                Locally change the log level for the zookeeper C client
                            
                                Spark mapWithState shuffles all data to one node
                            
                                How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model
                            
                                Not able to fetch result from hive transaction enabled table through spark-sql

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?

Tags:

apache-spark

apache-spark-sql

spark-dataframe

Dev

People also ask

1 Answers

nicoring

Recent Activity

Donate For Us