Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read the output of show operator back to a Dataset?

Assuming we have the following text file (output of df.show() command):

+----+---------+--------+
|col1|     col2|    col3|
+----+---------+--------+
|   1|pi number|3.141592|
|   2| e number| 2.71828|
+----+---------+--------+

Now i want to read/parse it as a DataFrame/Dataset. What is the most "sparkling" way to do this?

p.s. I'm interested in solutions for both scala and pyspark, that's why both tags are used.

like image 531
MaxU - stop WAR against UA Avatar asked Oct 21 '17 22:10

MaxU - stop WAR against UA


1 Answers

UPDATE: using "UNIVOCITY" parser lib i could get rid of one line where i was removing whitespaces in the column names:

Scala:

// read Spark Output Fixed width table:
def readSparkOutput(filePath: String) : org.apache.spark.sql.DataFrame = {
    val t = spark.read
                 .option("header","true")
                 .option("inferSchema","true")
                 .option("delimiter","|")
                 .option("parserLib","UNIVOCITY")
                 .option("ignoreLeadingWhiteSpace","true")
                 .option("ignoreTrailingWhiteSpace","true")
                 .option("comment","+")
                 .csv(filePath)
    t.select(t.columns.filterNot(_.startsWith("_c")).map(t(_)):_*)
}

PySpark:

def read_spark_output(file_path):
    t = spark.read \
             .option("header","true") \
             .option("inferSchema","true") \
             .option("delimiter","|") \
             .option("parserLib","UNIVOCITY") \
             .option("ignoreLeadingWhiteSpace","true") \
             .option("ignoreTrailingWhiteSpace","true") \
             .option("comment","+") \
             .csv("file:///tmp/spark.out")
    # select not-null columns
    return t.select([c for c in t.columns if not c.startswith("_")])

Usage example:

scala> val df = readSparkOutput("file:///tmp/spark.out")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]

scala> df.show
+----+---------+--------+
|col1|     col2|    col3|
+----+---------+--------+
|   1|pi number|3.141592|
|   2| e number| 2.71828|
+----+---------+--------+


scala> df.printSchema
root
 |-- col1: integer (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: double (nullable = true)

Old answer:

Here is my attempt in scala (Spark 2.2):

// read Spark Output Fixed width table:
val t = spark.read
    .option("header","true")
    .option("inferSchema","true")
    .option("delimiter","|")
    .option("comment","+")
    .csv("file:///temp/spark.out")
// select not-null columns
val cols = t.columns.filterNot(c => c.startsWith("_c")).map(a => t(a))
// trim spaces from columns
val colsTrimmed = t.columns.filterNot(c => c.startsWith("_c")).map(c => c.replaceAll("\\s+",""))
// reanme columns using 'colsTrimmed'
val df = t.select(cols:_*).toDF(colsTrimmed:_*)

It works, but i have a feeling that there must be much more elegant way to do this.

scala> df.show
+----+---------+--------+
|col1|     col2|    col3|
+----+---------+--------+
| 1.0|pi number|3.141592|
| 2.0| e number| 2.71828|
+----+---------+--------+

scala> df.printSchema
root
 |-- col1: double (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: double (nullable = true)
like image 86
MaxU - stop WAR against UA Avatar answered Nov 11 '22 18:11

MaxU - stop WAR against UA