Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get the last item from a list using pyspark?

Why does column 1st_from_end contain null:

from pyspark.sql.functions import split
df = sqlContext.createDataFrame([('a b c d',)], ['s',])
df.select(   split(df.s, ' ')[0].alias('0th'),
             split(df.s, ' ')[3].alias('3rd'),
             split(df.s, ' ')[-1].alias('1st_from_end')
         ).show()

enter image description here
I thought using [-1] was a pythonic way to get the last item in a list. How come it doesn't work in pyspark?

like image 266
jamiet Avatar asked Nov 07 '16 14:11

jamiet


People also ask

How do you find the last element of an array in PySpark?

element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.

How do you get the last Row in PySpark DataFrame?

Use tail() action to get the Last N rows from a DataFrame, this returns a list of class Row for PySpark and Array[Row] for Spark with Scala.

How do you slice in PySpark?

In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices.

What does take () do in PySpark?

Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.


2 Answers

For Spark 2.4+, use pyspark.sql.functions.element_at, see below from the documentation:

element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.

from pyspark.sql.functions import element_at, split, col

df = spark.createDataFrame([('a b c d',)], ['s',])

df.withColumn('arr', split(df.s, ' ')) \
  .select( col('arr')[0].alias('0th')
         , col('arr')[3].alias('3rd')
         , element_at(col('arr'), -1).alias('1st_from_end')
     ).show()

+---+---+------------+
|0th|3rd|1st_from_end|
+---+---+------------+
|  a|  d|           d|
+---+---+------------+
like image 59
jxc Avatar answered Oct 18 '22 10:10

jxc


If you're using Spark >= 2.4.0 see jxc's answer below.

In Spark < 2.4.0, dataframes API didn't support -1 indexing on arrays, but you could write your own UDF or use built-in size() function, for example:

>>> from pyspark.sql.functions import size
>>> splitted = df.select(split(df.s, ' ').alias('arr'))
>>> splitted.select(splitted.arr[size(splitted.arr)-1]).show()
+--------------------+
|arr[(size(arr) - 1)]|
+--------------------+
|                   d|
+--------------------+
like image 32
Mariusz Avatar answered Oct 18 '22 09:10

Mariusz