Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala-Spark(version1.5.2) Dataframes split error

I have an input file foo.txt with the following content:

c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|

I want to transform it to a Dataframe to perform some Sql queries:

var text = sc.textFile("foo.txt")
var header = text.first()
var rdd = text.filter(row => row != header)
case class Data(c1: String, c2: String, c3: String, c4: String, c5: String, c6: String, c7: String, c8: String)

Until this point everything is ok, the problem comes in the next sentence:

var df = rdd.map(_.split("\\|")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()

If I try to print df with df.show, I get an error message:

scala> df.show()
java.lang.ArrayIndexOutOfBoundsException: 7

I know that the error might be due to the split sentence. I also tried to split foo.txt using the following syntax:

var df = rdd.map(_.split("""|""")).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()

And then I get something like this:

scala> df.show()
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
|  c1  |     c2  |    c3    |     c4    |  c5 |     c6    |        c7      |       c8       |
+------+---------+----------+-----------+-----+-----------+----------------+----------------+
|     0|        0|         ||           |    ||          1|               .|               0|
|     0|        1|         ||          2|    ||          3|               .|               0|
+------+---------+----------+-----------+-----+-----------+----------------+----------------+

Therefore, my question is how can I correctly pass this file to a Dataframe.

EDIT: The error is in the first row due to || field without an intermediate space. This type of field definition depending on the examples works fine or crashes.

like image 654
qwerty Avatar asked Feb 04 '23 17:02

qwerty


2 Answers

This is because one of your lines is shorter than the others:

scala> var df = rdd.map(_.split("\\|")).map(_.length).collect()
df: Array[Int] = Array(7, 8)

You can fill in the rows manually (but you need to handle each case manually):

val df = rdd.map(_.split("\\|")).map{row =>
  row match {
    case Array(a,b,c,d,e,f,g,h) => Data(a,b,c,d,e,f,g,h)
    case Array(a,b,c,d,e,f,g) => Data(a,b,c,d,e,f,g," ")
  }
}

scala> df.show()
+---+---+---+---+---+----+---+---+
| c1| c2| c3| c4| c5|  c6| c7| c8|
+---+---+---+---+---+----+---+---+
| 00|   |1.0|1.0|  9|27.0|  0|   |
| 01|  2|3.0|4.0|  1|10.0|  1|  1|
+---+---+---+---+---+----+---+---+

EDIT:

A more generic solution would be something like this:

val df = rdd.map(_.split("\\|", -1)).map(_.slice(0,8)).map(p => Data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).toDF()

If you assume that you always have the right number of delimiters, it is safe to use this syntax an truncate the last value.

like image 85
jamborta Avatar answered Feb 15 '23 18:02

jamborta


My suggestion would be to use databrick's csv parser.

Link : https://github.com/databricks/spark-csv

To load your example :

I loaded a sample file similar to yours:

c1|c2|c3|c4|c5|c6|c7|c8|
00| |1.0|1.0|9|27.0|0||
01|2|3.0|4.0|1|10.0|1|1|

To create the dataframe use the below code:

  val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "|") // default is ","
    .load("foo.txt")
    .show

I got the below output

+---+---+---+---+---+----+---+----+---+
| c1| c2| c3| c4| c5|  c6| c7|  c8|   |
+---+---+---+---+---+----+---+----+---+
|  0|   |1.0|1.0|  9|27.0|  0|null|   |
|  1|  2|3.0|4.0|  1|10.0|  1|   1|   |
+---+---+---+---+---+----+---+----+---+

This way you do not have to bother about parsing the file yourself. You get a dataframe directly

like image 33
Sanchit Grover Avatar answered Feb 15 '23 19:02

Sanchit Grover