Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to alias columns programmatically in spark sql?

In spark SQL (perhaps only HiveQL) one can do:

select sex, avg(age) as avg_age from humans group by sex 

which would result in a DataFrame with columns named "sex" and "avg_age".

How can avg(age) be aliased to "avg_age" without using textual SQL?

Edit: After zero323 's answer, I need to add the constraint that:

The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).

like image 745
Prikso NAI Avatar asked Jul 21 '15 12:07

Prikso NAI


People also ask

How do you alias a column name in PySpark?

To create an alias of a column, we will use the . alias() method. This method is SQL equivalent of the 'AS' keyword which is used to create aliases. It gives a temporary name to our column of the output PySpark DataFrame.

How do I specify a column alias in SQL?

The basic syntax of a table alias is as follows. SELECT column1, column2.... FROM table_name AS alias_name WHERE [condition]; The basic syntax of a column alias is as follows.

What does alias do spark?

Alias of PySpark DataFrame column changes the name of the column without changing the type and the data.


2 Answers

Let's suppose human_df is the DataFrame for humans. Since Spark 1.3:

human_df.groupBy("sex").agg(avg("age").alias("avg_age")) 
like image 195
Robert Chevallier Avatar answered Sep 24 '22 06:09

Robert Chevallier


If you prefer to rename a single column it is possible to use withColumnRenamed method:

case class Person(name: String, age: Int)  val df = sqlContext.createDataFrame(     Person("Alice", 2) :: Person("Bob", 5) :: Nil)  df.withColumnRenamed("name", "first_name") 

Alternatively you can use alias method:

import org.apache.spark.sql.functions.avg  df.select(avg($"age").alias("average_age"))  

You can take it further with small helper:

import org.apache.spark.sql.Column  def normalizeName(c: Column) = {   val pattern = "\\W+".r   c.alias(pattern.replaceAllIn(c.toString, "_")) }  df.select(normalizeName(avg($"age"))) 
like image 40
zero323 Avatar answered Sep 23 '22 06:09

zero323