Apache Spark Dataset API has two methods i.e, <code>head(n:Int)</code> and <code>take(n:Int)</code>. Dataset.Scala source contains <pre class="prettyprint"><code>def take(n: Int): Array[T] = head(n) </code></pre> Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?

The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains <code>head</code> https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.

Apache Spark DataSet API : head(n:Int) vs take(n:Int)

Tags:

apache-spark

apache-spark-sql

spark-dataframe

Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int).

Dataset.Scala source contains

def take(n: Int): Array[T] = head(n)

Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?

740

asked Jul 17 '17 07:07

Krishna Reddy

1 Answers

The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.

115

answered Sep 19 '22 18:09

Luis

Related questions
                            
                                Creating an RDD to collect the results of an iterative calculation
                            
                                How to determine if object is a valid key-value pair in PySpark
                            
                                Apache Spark - Memory Exception Error -IntelliJ settings
                            
                                "error: type mismatch" in Spark with same found and required datatypes
                            
                                How is the Spark select-explode idiom implemented?
                            
                                PySpark Evaluation
                            
                                How to update spark configuration after resizing worker nodes in Cloud Dataproc
                            
                                How to Access Spark PipelineModel Parameters
                            
                                "Failed to find data source: parquet" when making a fat jar with maven
                            
                                How to create schema Array in data frame with spark
                            
                                Performance Of Joins in Spark-SQL
                            
                                Get row with maximum value from groupby with several columns in PySpark
                            
                                Function input() in pyspark
                            
                                Spark's int96 time type
                            
                                Spark's toDS vs to DF
                            
                                Broadcast Hash Join (BHJ) in Spark for full outer join (outer, full, fullouter)
                            
                                Access table in other than default scheme (database) from sparklyr
                            
                                Where is cached RDD stored (i.e. in a distributed way or on a single node)?
                            
                                Environment variables set up in Windows for pyspark
                            
                                WARN cluster.YarnScheduler: Initial job has not accepted any resources

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With