Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Usage of spark DataFrame "as" method

I am looking at spark.sql.DataFrame documentation.

There is

def as(alias: String): DataFrame
    Returns a new DataFrame with an alias set.
    Since
        1.3.0 

What is the purpose of this method? How is it used? Can there be an example?

I have not managed to find anything about this method online and the documentation is pretty non-existent. I have not managed to make any kind of alias using this method.

like image 978
Prikso NAI Avatar asked Jul 21 '15 11:07

Prikso NAI


People also ask

Why do we need DataFrame in Spark?

Features of Spark DataFramesSupport for integration with various Big Data tools. The ability to process kilobytes of data on smaller machines and petabytes on clusters. Catalyst optimizer for efficient data processing across multiple languages. Structured data handling through a schematic view of data.

What are the advantages of DataFrame?

Advantages: Pandas Dataframe able to Data Manipulation such as indexing, renaming, sorting, merging data frame. Updating, adding, and deleting columns are quite easier using Pandas. Pandas Dataframe supports multiple file formats.

What is DataFrame in Spark?

In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

Which is better Spark SQL or DataFrame?

RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage. Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD's.


1 Answers

Spark <= 1.5

It is more or less equivalent to SQL table aliases:

SELECT *
FROM table AS alias;

Example usage adapted from PySpark alias documentation:

import org.apache.spark.sql.functions.col
case class Person(name: String, age: Int)

val df = sqlContext.createDataFrame(
    Person("Alice", 2) :: Person("Bob", 5) :: Nil)

val df_as1 = df.as("df1")
val df_as2 = df.as("df2")
val joined_df = df_as1.join(
    df_as2, col("df1.name") === col("df2.name"), "inner")
joined_df.select(
    col("df1.name"), col("df2.name"), col("df2.age")).show

Output:

+-----+-----+---+
| name| name|age|
+-----+-----+---+
|Alice|Alice|  2|
|  Bob|  Bob|  5|
+-----+-----+---+

Same thing using SQL query:

df.registerTempTable("df")
sqlContext.sql("""SELECT df1.name, df2.name, df2.age
                  FROM df AS df1 JOIN df AS df2
                  ON df1.name == df2.name""")

What is the purpose of this method?

Pretty much avoiding ambiguous column references.

Spark 1.6+

There is also a new as[U](implicit arg0: Encoder[U]): Dataset[U] which is used to convert a DataFrame to a DataSet of a given type. For example:

df.as[Person]
like image 164
zero323 Avatar answered Oct 05 '22 23:10

zero323