How to use double pipe as delimiter in CSV?

Question

Spark 1.5 and Scala 2.10.6

I have a data file that is using "¦¦" as the delimiter. I am having a hard time parsing through this to create a data frame. Can multiple delimiters be used to create a data frame? The code works with a single broken pipe but not with multiple delimiters.

My Code:

val customSchema_1 = StructType(Array(
    StructField("ID", StringType, true), 
    StructField("FILLER", StringType, true), 
    StructField("CODE", StringType, true)));

val df_1 = sqlContext.read
    .format("com.databricks.spark.csv")
    .schema(customSchema_1)
    .option("delimiter", "¦¦")
    .load("example.txt")

Sample file:

12345¦¦  ¦¦10

lockwobr · Accepted Answer

I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the || with a tab and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.

var df = spark.sqlContext.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("delimiter", "	")
  .csv(spark.sqlContext.read.textFile("filename")
      .map(line => line.split("\|\|").mkString("	")))

Hope this helps some else.

EDIT:

As of spark 3.0.1 this works out of the box.

example:

val ds = List("name||id", "foo||12", "brian||34", """"cray||name"||123""", "cray||name||123").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

val csv = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "||").csv(ds)
csv: org.apache.spark.sql.DataFrame = [name: string, id: string]

csv.show
+----------+----+
|      name|  id|
+----------+----+
|       foo|  12|
|     brian|  34|
|cray||name| 123|
|      cray|name|
+----------+----+

evan.oman · Answer

So the actual error being emitted here is:

java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦

The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.

Given all of this, if your data is simple enough where you won't have entries containing ¦¦, I would load your data like so:

scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
    StructField("ID", StringType, true), 
    StructField("FILLER", StringType, true), 
    StructField("CODE", StringType, true)));

// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))

scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34

scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]

scala> df.show
+-----+------+----+
|   ID|FILLER|CODE|
+-----+------+----+
|12345|      |  10|
+-----+------+----+

AshishPatil · Answer

We tried to read data having custom delimiters and customizing column names for data frame in following way,

# Hold new column names saparately
headers ="JC_^!~_*>Year_^!~_*>Date_^!~_*>Service_Type^!~_*>KMs_Run^!~_*>

# '^!~_*>' This is field delimiter, so split string
head = headers.split("^!~_*>")

## Below command splits the S3 file with custom delimiter and converts into Dataframe
df = sc.textFile("s3://S3_Path/sample.txt").map(lambda x: x.split("^!~_*>")).toDF(head)

Passing head as parameter in toDF() assign new column names to dataframe created from text file having custom delimiters.

Hope this helps.

How to use double pipe as delimiter in CSV?

Tags:

scala

apache-spark

SFatima

3 Answers

lockwobr

evan.oman

AshishPatil

Recent Activity

Donate For Us

How to use double pipe as delimiter in CSV?

Tags:

scala

apache-spark

SFatima

3 Answers

lockwobr

evan.oman

AshishPatil

Related questions

Recent Activity

Donate For Us