Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I select a non-sequential subset elements from an array using Scala and Spark?

In Python, this is how I would do it.

>>> x
array([10,  9,  8,  7,  6,  5,  4,  3,  2])
>>> x[np.array([3, 3, 1, 8])]
array([7, 7, 9, 2])

This doesn't work in the Scala Spark shell:

scala> val indices = Array(3,2,0)
indices: Array[Int] = Array(3, 2, 0)

scala> val A = Array(10,11,12,13,14,15)
A: Array[Int] = Array(10, 11, 12, 13, 14, 15)

scala> A(indices)
<console>:28: error: type mismatch;
 found   : Array[Int]
 required: Int
              A(indices)

The foreach method doesn't work either:

scala> indices.foreach(println(_))
3
2
0

scala> indices.foreach(A(_))
<no output>

What I want is the result of B:

scala> val B = Array(A(3),A(2),A(0))
B: Array[Int] = Array(13, 12, 10)

However, I don't want to hard code it like that because I don't know how long indices is or what would be in it.

like image 293
Climbs_lika_Spyder Avatar asked Apr 02 '15 17:04

Climbs_lika_Spyder


People also ask

How do I get the subset of an array in spark?

Apache Spark / Spark SQL Functions Spark SQL provides a slice () function to get the subset or range of elements from an array (subarray) column of DataFrame and slice function is part of the Spark SQL Array functions group. In this article, I will explain the syntax of the slice () function and it’s usage with a scala example.

How to get the subset of elements from an array column?

Now, let’s use the slice () SQL function to slice the array and get the subset of elements from an array column. val sliceDF = df. withColumn ("languages", slice ( col ("languagesAtSchool"),2,3)) . drop ("languagesAtSchool") sliceDF. printSchema () sliceDF. show (false)

How to use slice() function in spark for columns?

Column slice function takes the first argument as Column of type ArrayType following start of the array index and the number of elements to extract from the array. Like all Spark SQL functions, slice () function returns a org.apache.spark.sql.Column of ArrayType.

How to use raw SQL in spark?

Since Spark provides a way to execute the raw SQL, let’s learn how to write the same slicing example using Spark SQL expression. In order to use raw SQL, first, you need to create a table using createOrReplaceTempView (). This creates a temporary view from the Dataframe and this view is available lifetime of current Spark context.


1 Answers

The most concise way I can think of is to flip your mental model and put indices first:

indices map A

And, I would potentially suggest using lift to return an Option

indices map A.lift
like image 92
Justin Pihony Avatar answered Sep 20 '22 01:09

Justin Pihony