In Spark, it is possible to set some hadoop configuration settings like, e.g.
System.setProperty("spark.hadoop.dfs.replication", "1")
This works, the replication factor is set to 1. Assuming that this is the case, I thought that this pattern (prepending "spark.hadoop." to a regular hadoop configuration property), would also work for the textinputformat.record.delimiter:
System.setProperty("spark.hadoop.textinputformat.record.delimiter", "\n\n")
However, it seems that spark just ignores this setting.
Do I set the textinputformat.record.delimiter
in the correct way?
Is there a simpler way of setting the textinputformat.record.delimiter
. I would like to avoid writing my own InputFormat
, since I really only need to obtain records delimited by two newlines.
I got this working with plain uncompressed files with the below function.
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
def nlFile(path: String) = {
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n")
sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
.map(_._2.toString)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With