Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe. <hr> So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows. <pre class="prettyprint"><code>val json : String = """{ "x": true, "y": "not true" }""" val rdd = sparkContext.parallelize(Seq(json)) val jsonDF = sqlContext.read.json(rdd) jsonDF.show() </code></pre> When I run that I get this following as output (via <code>.show()</code>): <pre class="prettyprint"><code>+----+--------+ | x| y| +----+--------+ |true|not true| +----+--------+ </code></pre> Now I want to add a new field to <code>jsonDF</code>, after it's created and without modifying the <code>json</code> string, such that the resultant DF would look like this: <pre class="prettyprint"><code>+----+--------+----+ | x| y| z| +----+--------+----+ |true|not true| red| +----+--------+----+ </code></pre> Meaning, I want to add a new "<code>z</code>" column to the DF, of type <code>StringType</code>, and then default all rows to contain a <code>z</code>-value of <code>"red"</code>. From that other question I have pieced the following pseudo-code together: <pre class="prettyprint"><code>val json : String = """{ "x": true, "y": "not true" }""" val rdd = sparkContext.parallelize(Seq(json)) val jsonDF = sqlContext.read.json(rdd) //jsonDF.show() val newDF = jsonDF.withColumn("z", jsonDF("col") + 1) newDF.show() </code></pre> But when I run this, I get a compiler error on that <code>.withColumn(...)</code> method: <pre class="prettyprint"><code>org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652) </code></pre> I also don't see any API methods that would allow me to set <code>"red"</code> as the default value. Any ideas as to where I'm going awry?

You can use <code>lit</code> function. First you have to import it <pre class="prettyprint"><code>import org.apache.spark.sql.functions.lit </code></pre> and use it as shown below <pre class="prettyprint"><code>jsonDF.withColumn("z", lit("red")) </code></pre> Type of the column will be inferred automatically.

Adding StringType column to existing Spark DataFrame and then applying default values

Tags:

dataframe

scala

apache-spark

apache-spark-sql

Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.

So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.

val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)

jsonDF.show()

When I run that I get this following as output (via .show()):

+----+--------+
|   x|       y|
+----+--------+
|true|not true|
+----+--------+

Now I want to add a new field to jsonDF, after it's created and without modifying the json string, such that the resultant DF would look like this:

+----+--------+----+
|   x|       y|   z|
+----+--------+----+
|true|not true| red|
+----+--------+----+

Meaning, I want to add a new "z" column to the DF, of type StringType, and then default all rows to contain a z-value of "red".

From that other question I have pieced the following pseudo-code together:

val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)

//jsonDF.show()

val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)

newDF.show()

But when I run this, I get a compiler error on that .withColumn(...) method:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
    at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
    at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)

I also don't see any API methods that would allow me to set "red" as the default value. Any ideas as to where I'm going awry?

846

asked Oct 10 '16 16:10

smeeb

1 Answers

You can use lit function. First you have to import it

import org.apache.spark.sql.functions.lit

and use it as shown below

jsonDF.withColumn("z", lit("red"))

Type of the column will be inferred automatically.

167

answered Nov 14 '22 19:11

zero323

Related questions
                            
                                Real life examples of Scala infix types
                            
                                Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
                            
                                How does one flatten an array of arrays in Scala?
                            
                                Feature-level behavior testing tools for Java/Scala
                            
                                Code Optimization with Scala
                            
                                Bug in Scala's type system?
                            
                                Scala Swing package not found
                            
                                Library vs Language extension
                            
                                "val a:A = new B ", what's the point?
                            
                                Scala analogue to "with object do begin ... end" (shortcutting method access)
                            
                                Main method is not called in Scala script
                            
                                Why is this Clojure program so slow? How to make it run fast?
                            
                                "Flattening" a List in Scala & Haskell
                            
                                Scala group consecutive elements in list where function is true
                            
                                Idiomatic Scala for Options in place of if/else/else chain
                            
                                Comparing two Byte[]'s for equality in Scala (checking binary image data)
                            
                                How to test an exception case with zio-test
                            
                                Cannot create a tuple containing a null in Scala
                            
                                Scala a better println
                            
                                Play [Scala]: How to flatten a JSON object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With