I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:
column_names = Array("A","B","C")
I'd like to do a df.select()
in such a way, that I can specify which columns not to select. Example: let's say I do not want to select columns "B". I tried
df.select(column_names.filter(_!="B"))
but this does not work, as
org.apache.spark.sql.DataFrame cannot be applied to (Array[String])
So, here it says it should work with a Seq instead. However, trying
df.select(column_names.filter(_!="B").toSeq)
results in
org.apache.spark.sql.DataFrame cannot be applied to (Seq[String]).
What am I doing wrong?
Since Spark 1.4 you can use drop
method:
Scala:
case class Point(x: Int, y: Int) val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil) df.drop("y")
Python:
df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"]) df.drop("y") ## DataFrame[x: bigint]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With