I am looking at spark.sql.DataFrame documentation.
There is
def as(alias: String): DataFrame
Returns a new DataFrame with an alias set.
Since
1.3.0
What is the purpose of this method? How is it used? Can there be an example?
I have not managed to find anything about this method online and the documentation is pretty non-existent. I have not managed to make any kind of alias using this method.
Features of Spark DataFramesSupport for integration with various Big Data tools. The ability to process kilobytes of data on smaller machines and petabytes on clusters. Catalyst optimizer for efficient data processing across multiple languages. Structured data handling through a schematic view of data.
Advantages: Pandas Dataframe able to Data Manipulation such as indexing, renaming, sorting, merging data frame. Updating, adding, and deleting columns are quite easier using Pandas. Pandas Dataframe supports multiple file formats.
In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage. Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD's.
Spark <= 1.5
It is more or less equivalent to SQL table aliases:
SELECT *
FROM table AS alias;
Example usage adapted from PySpark alias
documentation:
import org.apache.spark.sql.functions.col
case class Person(name: String, age: Int)
val df = sqlContext.createDataFrame(
Person("Alice", 2) :: Person("Bob", 5) :: Nil)
val df_as1 = df.as("df1")
val df_as2 = df.as("df2")
val joined_df = df_as1.join(
df_as2, col("df1.name") === col("df2.name"), "inner")
joined_df.select(
col("df1.name"), col("df2.name"), col("df2.age")).show
Output:
+-----+-----+---+
| name| name|age|
+-----+-----+---+
|Alice|Alice| 2|
| Bob| Bob| 5|
+-----+-----+---+
Same thing using SQL query:
df.registerTempTable("df")
sqlContext.sql("""SELECT df1.name, df2.name, df2.age
FROM df AS df1 JOIN df AS df2
ON df1.name == df2.name""")
What is the purpose of this method?
Pretty much avoiding ambiguous column references.
Spark 1.6+
There is also a new as[U](implicit arg0: Encoder[U]): Dataset[U]
which is used to convert a DataFrame
to a DataSet
of a given type. For example:
df.as[Person]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With