Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get element by Index in Spark RDD (Java)

I know the method rdd.firstwfirst() which gives me the first element in an RDD.

Also there is the method rdd.take(num) Which gives me the first "num" elements.

But isn't there a possibility to get an element by index?

Thanks.e

like image 679
progNewbie Avatar asked Nov 09 '14 13:11

progNewbie


People also ask

How do you find RDD with element indices?

This should be possible by first indexing the RDD. The transformation zipWithIndex provides a stable indexing, numbering each element in its original order. If you're expecting to use lookup often on the same RDD, I'd recommend to cache the indexKey RDD to improve performance.

What is zipWithIndex spark?

Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.

Which function returns a list that contains all the elements in RDD?

Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.

What is spark RDD reduce() method in Java?

A quick guide to explore the Spark RDD reduce () method in java programming to find sum, min and max values from the data set. 1. Overview In this tutorial, we will learn how to use the Spark RDD reduce () method using java programming language. Most of the developers use the same method reduce () in pyspark but in this article, we will understand

When to use collect() with spark dataframe example?

We should use the collect () on smaller dataset usually after filter (), group (), count () e.t.c. Retrieving on larger dataset results in out of memory. In this Spark article, I will explain the usage of collect () with DataFrame example, when to avoid it, and the difference between collect () and select ().

How to reduce RDD of doubles in Java?

Perform a simple map reduce, mapping the Doubles RDD to a new RDD of integers, then reduce it by calling the sum function of Integer class to return the summed value of your RDD. If you run your project, your results should look something like this:

What is the latest version of Scala in spark?

At the time of writing, the latest version of Scala is 2.13, but spark does not support it yet. Spark Core is the main Spark engine which you use to build your RDDs. Spark SQL provides an interface to perform complex SQL operations on your dataset with ease.


1 Answers

This should be possible by first indexing the RDD. The transformation zipWithIndex provides a stable indexing, numbering each element in its original order.

Given: rdd = (a,b,c)

val withIndex = rdd.zipWithIndex // ((a,0),(b,1),(c,2)) 

To lookup an element by index, this form is not useful. First we need to use the index as key:

val indexKey = withIndex.map{case (k,v) => (v,k)}  //((0,a),(1,b),(2,c)) 

Now, it's possible to use the lookup action in PairRDD to find an element by key:

val b = indexKey.lookup(1) // Array(b) 

If you're expecting to use lookup often on the same RDD, I'd recommend to cache the indexKey RDD to improve performance.

How to do this using the Java API is an exercise left for the reader.

like image 102
maasg Avatar answered Sep 29 '22 06:09

maasg