Setting textinputformat.record.delimiter in spark

Question

In Spark, it is possible to set some hadoop configuration settings like, e.g.

System.setProperty("spark.hadoop.dfs.replication", "1")

This works, the replication factor is set to 1. Assuming that this is the case, I thought that this pattern (prepending "spark.hadoop." to a regular hadoop configuration property), would also work for the textinputformat.record.delimiter:

System.setProperty("spark.hadoop.textinputformat.record.delimiter", "

")

However, it seems that spark just ignores this setting. Do I set the textinputformat.record.delimiter in the correct way? Is there a simpler way of setting the textinputformat.record.delimiter. I would like to avoid writing my own InputFormat, since I really only need to obtain records delimited by two newlines.

Andrew · Accepted Answer

I got this working with plain uncompressed files with the below function.

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

def nlFile(path: String) = {
    val conf = new Configuration
    conf.set("textinputformat.record.delimiter", "
")
    sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
      .map(_._2.toString)
}

Setting textinputformat.record.delimiter in spark

Tags:

scala

apache-spark

hadoop

mapreduce

ptikobj

1 Answers

Andrew

Recent Activity

Donate For Us

Setting textinputformat.record.delimiter in spark

Tags:

scala

apache-spark

hadoop

mapreduce

ptikobj

1 Answers

Andrew

Related questions

Recent Activity

Donate For Us