Apache Spark Dataset API has two methods i.e, head(n:Int)
and take(n:Int)
.
Dataset.Scala source contains
def take(n: Int): Array[T] = head(n)
Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets.
take (num: int) → List[T][source] Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. Translated from the Scala implementation in RDD#take().
collect() shows content and structure/metadata. e.g. df. take(some number) can be used to shows content and structure/metadata for a limited number of rows for a very large dataset.
RDD – RDD API is slower to perform simple grouping and aggregation operations. DataFrame – DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets. DataSet – In Dataset it is faster to perform aggregation operation on plenty of data sets.
The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With