Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: Reading files using different delimiter than new line

Tags:

apache-spark

I'm using Apache Spark 1.0.1. I have many files delimited with UTF8 \u0001 and not with the usual new line \n. How can I read such files in Spark? Meaning, the default delimiter of sc.textfile("hdfs:///myproject/*") is \n, and I want to change it to \u0001.

like image 375
dotan Avatar asked Aug 12 '14 08:08

dotan


People also ask

How to read a text file from a spark dataframe?

spark.read.text () method is used to read a text file into DataFrame. like in RDD, we can also use this method to read multiple files at a time, reading patterns matching files and finally reading all files from a directory.

How do I read multiple text files in spark RDD?

Spark core provides textFile () & wholeTextFiles () methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.

How do I read a multiple line CSV file in spark?

Spark Read multiline (multiple line) CSV File 1 Sample CSV File With Multiline Records. ... 2 Spark Read CSV File (Default) Let’s Read a CSV file into Spark DataFrame with out any options. ... 3 Spark Read CSV using Multiline Option. ... 4 Load when the multiline record doesn’t have an escape character. ... 5 Conclusion. ...

How to add comma delimiter in spark CSV file?

By default, Spark CSV data source considers file contains records with a comma delimiter. In case if you have another delimiter like pipe character (|) use spark.read.option ("delimiter","|") option.


1 Answers

You can use textinputformat.record.delimiter to set the delimiter for TextInputFormat, E.g.,

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "X")
val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input.map { case (_, text) => text.toString}
println(lines.collect)

For example, my input is a file containing one line aXbXcXd. The above code will output

Array(a, b, c, d)
like image 60
zsxwing Avatar answered Oct 03 '22 22:10

zsxwing