I have an input file foo.txt
with the following content:
c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|
I want to transform it to a Dataframe
to perform some Sql
queries:
var text = sc.textFile("foo.txt")
var header = text.first()
var rdd = text.filter(row => row != header)
case class Data(c1: String, c2: String, c3: String, c4: String, c5: String, c6: String, c7: String, c8: String)
Until this point everything is ok, the problem comes in the next sentence:
var df = rdd.map(_.split("\\|")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
If I try to print df
with df.show
, I get an error message:
scala> df.show()
java.lang.ArrayIndexOutOfBoundsException: 7
I know that the error might be due to the split sentence. I also tried to split foo.txt
using the following syntax:
var df = rdd.map(_.split("""|""")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
And then I get something like this:
scala> df.show()
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 |
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
| 0| 0| || | || 1| .| 0|
| 0| 1| || 2| || 3| .| 0|
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
Therefore, my question is how can I correctly pass this file to a Dataframe.
EDIT: The error is in the first row due to ||
field without an intermediate space. This type of field definition depending on the examples works fine or crashes.
This is because one of your lines is shorter than the others:
scala> var df = rdd.map(_.split("\\|")).map(_.length).collect()
df: Array[Int] = Array(7, 8)
You can fill in the rows manually (but you need to handle each case manually):
val df = rdd.map(_.split("\\|")).map{row =>
row match {
case Array(a,b,c,d,e,f,g,h) => Data(a,b,c,d,e,f,g,h)
case Array(a,b,c,d,e,f,g) => Data(a,b,c,d,e,f,g," ")
}
}
scala> df.show()
+---+---+---+---+---+----+---+---+
| c1| c2| c3| c4| c5| c6| c7| c8|
+---+---+---+---+---+----+---+---+
| 00| |1.0|1.0| 9|27.0| 0| |
| 01| 2|3.0|4.0| 1|10.0| 1| 1|
+---+---+---+---+---+----+---+---+
EDIT:
A more generic solution would be something like this:
val df = rdd.map(_.split("\\|", -1)).map(_.slice(0,8)).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()
If you assume that you always have the right number of delimiters, it is safe to use this syntax an truncate the last value.
My suggestion would be to use databrick's csv parser.
Link : https://github.com/databricks/spark-csv
To load your example :
I loaded a sample file similar to yours:
c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|
To create the dataframe use the below code:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", "|") // default is ","
.load("foo.txt")
.show
I got the below output
+---+---+---+---+---+----+---+----+---+
| c1| c2| c3| c4| c5| c6| c7| c8| |
+---+---+---+---+---+----+---+----+---+
| 0| |1.0|1.0| 9|27.0| 0|null| |
| 1| 2|3.0|4.0| 1|10.0| 1| 1| |
+---+---+---+---+---+----+---+----+---+
This way you do not have to bother about parsing the file yourself. You get a dataframe directly
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With