Spark 1.5 and Scala 2.10.6
I have a data file that is using "¦¦" as the delimiter. I am having a hard time parsing through this to create a data frame. Can multiple delimiters be used to create a data frame? The code works with a single broken pipe but not with multiple delimiters.
My Code:
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
val df_1 = sqlContext.read
.format("com.databricks.spark.csv")
.schema(customSchema_1)
.option("delimiter", "¦¦")
.load("example.txt")
Sample file:
12345¦¦ ¦¦10
I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the ||
with a tab
and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.
var df = spark.sqlContext.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.csv(spark.sqlContext.read.textFile("filename")
.map(line => line.split("\\|\\|").mkString("\t")))
Hope this helps some else.
EDIT:
As of spark 3.0.1 this works out of the box.
example:
val ds = List("name||id", "foo||12", "brian||34", """"cray||name"||123""", "cray||name||123").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]
val csv = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "||").csv(ds)
csv: org.apache.spark.sql.DataFrame = [name: string, id: string]
csv.show
+----------+----+
| name| id|
+----------+----+
| foo| 12|
| brian| 34|
|cray||name| 123|
| cray|name|
+----------+----+
So the actual error being emitted here is:
java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦
The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.
Given all of this, if your data is simple enough where you won't have entries containing ¦¦
, I would load your data like so:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))
scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34
scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]
scala> df.show
+-----+------+----+
| ID|FILLER|CODE|
+-----+------+----+
|12345| | 10|
+-----+------+----+
We tried to read data having custom delimiters and customizing column names for data frame in following way,
# Hold new column names saparately
headers ="JC_^!~_*>Year_^!~_*>Date_^!~_*>Service_Type^!~_*>KMs_Run^!~_*>
# '^!~_*>' This is field delimiter, so split string
head = headers.split("^!~_*>")
## Below command splits the S3 file with custom delimiter and converts into Dataframe
df = sc.textFile("s3://S3_Path/sample.txt").map(lambda x: x.split("^!~_*>")).toDF(head)
Passing head as parameter in toDF() assign new column names to dataframe created from text file having custom delimiters.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With