Difference between Spark RDD's take(1) and first()

Tags:

I used to think that rdd.take(1) and rdd.first() are exactly the same. However I began to wonder if this is really true after my colleague pointed me to Spark's officiation documentation on RDD:

first(): Return the first element in this RDD.

take(num): Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

My questions are:

Is the underlying implementation of first() the same as take(1)?
Suppose rdd1 and rdd2 are constructed from the same csv, can I safely assume that rdd1.take(1) and rdd2.first() will always return the same result, i.e., the first row of the csv? What if rdd1 and rdd2 are partitioned differently?

630

asked May 28 '16 04:05

Ida

1 Answers

Infact first is implemented in terms of take.

Following is taken from spark's source of RDD.scala. first calls take(1) and returns the first element if found.

  def first(): T = withScope {
    take(1) match {
      case Array(t) => t
      case _ => throw new UnsupportedOperationException("empty collection")
    }
  }

take(num) tries to take num elements from starting from RDD's 0th partition (if you consider 0 based indexes). So the behavior of take(1) and first will be identical.

Even the spark programming guide confirms this.

About your second question: it depends what you mean when you say partitioned differently. If you are calling sc.textFile("/path/to/file") with or without numPartitions, it wouldn't matter because 0th partition will always be 0th partition. So Yes, you can assume that they will have the same first element.

EDIT: Partitions in RDD are ordered, the physical first line in your CSV will end up in the 0th partition on RDD. And take(1) and first both will return that first row of 0th partition.

answered Sep 28 '22 05:09

Pranav Shukla

Related questions
                            
                                Error while exploding a struct column in Spark
                            
                                In Spark API, What is the difference between makeRDD functions and parallelize function?
                            
                                Spark DataFrame and renaming multiple columns (Java)
                            
                                How do I order fields of my Row objects in Spark (Python)
                            
                                How to read streaming dataset once and output to multiple sinks?
                            
                                Difference between sc.textFile and spark.read.text in Spark
                            
                                Spark: Repartition strategy after reading text file
                            
                                How does Spark interoperate with CPython
                            
                                Scale(Normalise) a column in SPARK Dataframe - Pyspark
                            
                                Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark
                            
                                Addition of two RDD[mllib.linalg.Vector]'s
                            
                                How to deal with tasks running too long (comparing to others in job) in yarn-client?
                            
                                Spark Streaming get warn "replicated to only 0 peer(s) instead of 1 peers"
                            
                                Should we parallelize a DataFrame like we parallelize a Seq before training
                            
                                Package-private scope in Scala visible from Java
                            
                                SparkContext.addFile vs spark-submit --files
                            
                                In spark, how does broadcast work?
                            
                                How to execute multi line sql in spark sql
                            
                                Spark fails to start in local mode when disconnected [Possible bug in handling IPv6 in Spark??]
                            
                                Spark: Reading files using different delimiter than new line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between Spark RDD's take(1) and first()

Tags:

apache-spark

rdd

pyspark

Ida

People also ask

1 Answers

Pranav Shukla

Recent Activity

Donate For Us