Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?

I am able to write it into

  • ORC

  • PARQUET

    directly and

  • TEXTFILE

  • AVRO

using additional dependencies from databricks.

    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-csv_2.10</artifactId>
        <version>1.5.0</version>
    </dependency>
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-avro_2.10</artifactId>
        <version>2.0.1</version>
    </dependency>

Sample code:

    SparkContext sc = new SparkContext(conf);
    HiveContext hc = new HiveContext(sc);
    DataFrame df = hc.table(hiveTableName);
    df.printSchema();
    DataFrameWriter writer = df.repartition(1).write();

    if ("ORC".equalsIgnoreCase(hdfsFileFormat)) {
        writer.orc(outputHdfsFile);

    } else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) {
        writer.parquet(outputHdfsFile);

    } else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) {
        writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile);

    } else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) {
        writer.format("com.databricks.spark.avro").save(outputHdfsFile);
    }

Is there any way to write dataframe into hadoop SequenceFile and RCFile?

like image 582
Dev Avatar asked Oct 03 '16 11:10

Dev


People also ask

Can we create Dataframe using tables in Hive?

DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

How do you load a Hive table into Pyspark Dataframe?

Spark provides HiveContext class to access the hive tables directly in Spark. First, we need to import this class using the import statement like “from pyspark. sql import HiveContext“. Then, we can use this class to create a context for the hive and read the hive tables into Spark dataframe.


1 Answers

You can use void saveAsObjectFile(String path) to save a RDD as a SequenceFile of serialized objects. So in your case you have to to retrieve the RDD from the DataFrame:

JavaRDD<Row> rdd = df.javaRDD;
rdd.saveAsObjectFile(outputHdfsFile);
like image 196
nicoring Avatar answered Nov 30 '22 20:11

nicoring