Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to name aggregate columns?

I'm using Spark in Scala and my aggregated columns are anonymous. Is there a convenient way to rename multiple columns from a dataset? I thought about imposing a schema with as but the key column is a struct (due to the groupBy operation), and I can't find out how to define a case class with a StructType in it.

I tried defining a schema as follows:

val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
                                                             StructField("dst", IntegerType), true)), 
                              StructField("count", LongType, true))
edge_count.as[returnSchema]

but I got a compile error:

Message: <console>:74: error: overloaded method value apply with alternatives:
  (fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
  (fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
  (fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
 cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, Boolean)
       val returnSchema = StructType(StructField("edge", StructType(StructField("src", IntegerType, true),
like image 861
Emre Avatar asked Jul 25 '16 19:07

Emre


People also ask

How do you name columns in aggregate in R?

Method 1: Using setNames() method The setNames() method is used to specify the name of an object and then return the object. In case of data frame, the columns can be renamed with new names, using the c() method.

How do you rename an aggregate column in SQL?

You can use the AS keyword to rename the aggregate function column in the result set or any other field in the SELECT statement. You cannot use the AS keyword to rename a table in the FROM clause. When you use the AS keyword to rename the column, you must use this new name to refer to the column.

What is an aggregated column?

An Aggregate Column element adds a new column to the datalayer that represents an aggregated value of the data in one of the other columns in the datalayer.

How do you change the name of a column in a list in R?

To rename a column in R, you can use the rename() function from dplyr. For example, if you want to rename the column “A” to “B” again, you can run the following code: rename(dataframe, B = A) .


1 Answers

The best solution is to name your columns explicitly, e.g.,

df
  .groupBy('a, 'b)
  .agg(
    expr("count(*) as cnt"),
    expr("sum(x) as x"),
    expr("sum(y)").as("y")
  )

If you are using a dataset, you have to provide the type of your columns, e.g., expr("count(*) as cnt").as[Long].

You can use the DSL directly but I often find it to be more verbose than simple SQL expressions.

If you want to do mass renames, use a Map and then foldLeft the dataframe.

like image 76
Sim Avatar answered Sep 17 '22 14:09

Sim