Spark: Reading files using different delimiter than new line

Tags:

apache-spark

I'm using Apache Spark 1.0.1. I have many files delimited with UTF8 \u0001 and not with the usual new line \n. How can I read such files in Spark? Meaning, the default delimiter of sc.textfile("hdfs:///myproject/*") is \n, and I want to change it to \u0001.

375

asked Aug 12 '14 08:08

dotan

1 Answers

You can use textinputformat.record.delimiter to set the delimiter for TextInputFormat, E.g.,

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "X")
val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input.map { case (_, text) => text.toString}
println(lines.collect)

For example, my input is a file containing one line aXbXcXd. The above code will output

Array(a, b, c, d)

answered Oct 03 '22 22:10

zsxwing

Related questions
                            
                                pyspark parse fixed width text file
                            
                                Error while exploding a struct column in Spark
                            
                                In Spark API, What is the difference between makeRDD functions and parallelize function?
                            
                                Spark DataFrame and renaming multiple columns (Java)
                            
                                How do I order fields of my Row objects in Spark (Python)
                            
                                How to read streaming dataset once and output to multiple sinks?
                            
                                Difference between sc.textFile and spark.read.text in Spark
                            
                                Spark: Repartition strategy after reading text file
                            
                                How does Spark interoperate with CPython
                            
                                Scale(Normalise) a column in SPARK Dataframe - Pyspark
                            
                                Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark
                            
                                Addition of two RDD[mllib.linalg.Vector]'s
                            
                                How to deal with tasks running too long (comparing to others in job) in yarn-client?
                            
                                Spark Streaming get warn "replicated to only 0 peer(s) instead of 1 peers"
                            
                                Should we parallelize a DataFrame like we parallelize a Seq before training
                            
                                Package-private scope in Scala visible from Java
                            
                                SparkContext.addFile vs spark-submit --files
                            
                                In spark, how does broadcast work?
                            
                                How to execute multi line sql in spark sql
                            
                                Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With