I'm using Spark in Scala and my aggregated columns are anonymous. Is there a convenient way to rename multiple columns from a dataset? I thought about imposing a schema with as
but the key column is a struct (due to the groupBy
operation), and I can't find out how to define a case class
with a StructType
in it.
I tried defining a schema as follows:
val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
StructField("dst", IntegerType), true)),
StructField("count", LongType, true))
edge_count.as[returnSchema]
but I got a compile error:
Message: <console>:74: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, Boolean)
val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
Method 1: Using setNames() method The setNames() method is used to specify the name of an object and then return the object. In case of data frame, the columns can be renamed with new names, using the c() method.
You can use the AS keyword to rename the aggregate function column in the result set or any other field in the SELECT statement. You cannot use the AS keyword to rename a table in the FROM clause. When you use the AS keyword to rename the column, you must use this new name to refer to the column.
An Aggregate Column element adds a new column to the datalayer that represents an aggregated value of the data in one of the other columns in the datalayer.
To rename a column in R, you can use the rename() function from dplyr. For example, if you want to rename the column “A” to “B” again, you can run the following code: rename(dataframe, B = A) .
The best solution is to name your columns explicitly, e.g.,
df
.groupBy('a, 'b)
.agg(
expr("count(*) as cnt"),
expr("sum(x) as x"),
expr("sum(y)").as("y")
)
If you are using a dataset, you have to provide the type of your columns, e.g., expr("count(*) as cnt").as[Long]
.
You can use the DSL directly but I often find it to be more verbose than simple SQL expressions.
If you want to do mass renames, use a Map
and then foldLeft
the dataframe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With