Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark DataSet API : head(n:Int) vs take(n:Int)

Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int).

Dataset.Scala source contains

def take(n: Int): Array[T] = head(n) 

Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?

like image 740
Krishna Reddy Avatar asked Jul 17 '17 07:07

Krishna Reddy


People also ask

What are the three API types that are compatible with Spark?

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets.

What is Take () in Spark?

take (num: int) → List[T][source] Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. Translated from the Scala implementation in RDD#take().

What is difference between take and collect in Spark?

collect() shows content and structure/metadata. e.g. df. take(some number) can be used to shows content and structure/metadata for a limited number of rows for a very large dataset.

Which is faster DataFrame or Dataset in Spark?

RDD – RDD API is slower to perform simple grouping and aggregation operations. DataFrame – DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets. DataSet – In Dataset it is faster to perform aggregation operation on plenty of data sets.


1 Answers

The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.

like image 115
Luis Avatar answered Sep 19 '22 18:09

Luis