How to read/write protocol buffer messages with Apache Spark?

Tags:

I want to Read/write protocol buffer messages from/to HDFS with Apache Spark. I found these suggested ways:

1) Convert protobuf messsages to Json with Google's Gson Library and then read/write them by SparkSql. This solution is explained in this link But I think doing that (convert to json) is an extra task.

2) Convert to Parquet file. There are parquet-mr and sparksql-protobuf github projects for this way but I don't want parquet file because I always work with all columns (not some columns) and in this way Parquet Format does not give me any gain (at least I think).

3) ScalaPB. May be it's what I am looking for. but in scala language that I don't know anything about it. I am looking for a java-based solution. This youtube video introduce scalaPB and explain how to use it (for scala developers).

4) Through the use of the sequence file and this is what I looking for, but found nothing about that. So, my question is: How can I write protobuf messages to sequence file on HDFS and from that? Any other suggestion will be useful.

5) Through twitter's Elephant-bird Library.

863

asked Aug 30 '18 11:08

DAVID_ROA

1 Answers

Though a bit hidden between the points, you seem to be asking how to write to a sequencefile in spark. I found an example here.

// Importing org.apache.hadoop.io package
import org.apache.hadoop.io._

// As we need data in sequence file format to read. Let us see how to write first
// Reading data from text file format
val dataRDD = sc.textFile("/public/retail_db/orders")

// Using null as key and value will be of type Text while saving in sequence file format
// By Int and String, we do not need to convert types into IntWritable and Text
// But for others we need to convert to writable object
// For example, if the key/value is of type Long, we might have to 
// type cast by saying new LongWritable(object)
dataRDD.
  map(x => (NullWritable.get(), x)).
  saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

// Saving in sequence file with key of type Int and value of type String
dataRDD.
  map(x => (x.split(",")(0).toInt, x.split(",")(1))).
  saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

answered Oct 07 '22 14:10

Dennis Jaheruddin

Related questions
                            
                                How to specify the path where saveAsTable saves files to?
                            
                                terminating a spark step in aws
                            
                                How to reverse ordering for RDD.takeOrdered()?
                            
                                Aggregate function in spark-sql not found
                            
                                Python worker failed to connect back
                            
                                NullPointerException in Scala Spark, appears to be caused be collection type?
                            
                                Spark com.fasterxml.jackson.module error
                            
                                How to count number of columns in Spark Dataframe?
                            
                                Upload zip file using --archives option of spark-submit on yarn
                            
                                Removing empty strings from maps in scala
                            
                                idea sbt java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
                            
                                How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
                            
                                "Bad substitution" when submitting spark job to yarn-cluster
                            
                                PySpark: when function with multiple outputs [duplicate]
                            
                                Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary
                            
                                Spark LDA consumes too much memory
                            
                                apache spark "Py4JError: Answer from Java side is empty"
                            
                                SparkUI for pyspark - corresponding line of code for each stage?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read/write protocol buffer messages with Apache Spark?

Tags:

apache-spark

protocol-buffers

hdfs

sequencefile

DAVID_ROA

People also ask

1 Answers

Dennis Jaheruddin

Recent Activity

Donate For Us