Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to use double pipe as delimiter in CSV?

Spark 1.5 and Scala 2.10.6

I have a data file that is using "¦¦" as the delimiter. I am having a hard time parsing through this to create a data frame. Can multiple delimiters be used to create a data frame? The code works with a single broken pipe but not with multiple delimiters.

My Code:

val customSchema_1 = StructType(Array(
    StructField("ID", StringType, true), 
    StructField("FILLER", StringType, true), 
    StructField("CODE", StringType, true)));

val df_1 = sqlContext.read
    .option("delimiter", "¦¦")

Sample file:

12345¦¦  ¦¦10
like image 721
SFatima Avatar asked Dec 21 '16 17:12


3 Answers

I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the || with a tab and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.

var df = spark.sqlContext.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("delimiter", "\t")
      .map(line => line.split("\\|\\|").mkString("\t")))

Hope this helps some else.


As of spark 3.0.1 this works out of the box.


val ds = List("name||id", "foo||12", "brian||34", """"cray||name"||123""", "cray||name||123").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

val csv = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "||").csv(ds)
csv: org.apache.spark.sql.DataFrame = [name: string, id: string]

|      name|  id|
|       foo|  12|
|     brian|  34|
|cray||name| 123|
|      cray|name|
like image 53
lockwobr Avatar answered Oct 31 '22 15:10


So the actual error being emitted here is:

java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦

The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.

Given all of this, if your data is simple enough where you won't have entries containing ¦¦, I would load your data like so:

scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
    StructField("ID", StringType, true), 
    StructField("FILLER", StringType, true), 
    StructField("CODE", StringType, true)));

// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))

scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34

scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]

scala> df.show
|12345|      |  10|
like image 5
evan.oman Avatar answered Oct 31 '22 16:10


We tried to read data having custom delimiters and customizing column names for data frame in following way,

# Hold new column names saparately
headers ="JC_^!~_*>Year_^!~_*>Date_^!~_*>Service_Type^!~_*>KMs_Run^!~_*>

# '^!~_*>' This is field delimiter, so split string
head = headers.split("^!~_*>")

## Below command splits the S3 file with custom delimiter and converts into Dataframe
df = sc.textFile("s3://S3_Path/sample.txt").map(lambda x: x.split("^!~_*>")).toDF(head)

Passing head as parameter in toDF() assign new column names to dataframe created from text file having custom delimiters.

Hope this helps.

like image 1
AshishPatil Avatar answered Oct 31 '22 15:10
