How do I get the last item from a list using pyspark?

Q: How do you get the last Row in PySpark DataFrame?

Use tail() action to get the Last N rows from a DataFrame, this returns a list of class Row for PySpark and Array[Row] for Spark with Scala.

Q: How do you slice in PySpark?

In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices.

Q: What does take () do in PySpark?

Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

Tags:

apache-spark

apache-spark-sql

pyspark

Why does column 1st_from_end contain null:

from pyspark.sql.functions import split
df = sqlContext.createDataFrame([('a b c d',)], ['s',])
df.select(   split(df.s, ' ')[0].alias('0th'),
             split(df.s, ' ')[3].alias('3rd'),
             split(df.s, ' ')[-1].alias('1st_from_end')
         ).show()

enter image description here
I thought using [-1] was a pythonic way to get the last item in a list. How come it doesn't work in pyspark?

266

asked Nov 07 '16 14:11

jamiet

2 Answers

For Spark 2.4+, use pyspark.sql.functions.element_at, see below from the documentation:

element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.

from pyspark.sql.functions import element_at, split, col

df = spark.createDataFrame([('a b c d',)], ['s',])

df.withColumn('arr', split(df.s, ' ')) \
  .select( col('arr')[0].alias('0th')
         , col('arr')[3].alias('3rd')
         , element_at(col('arr'), -1).alias('1st_from_end')
     ).show()

+---+---+------------+
|0th|3rd|1st_from_end|
+---+---+------------+
|  a|  d|           d|
+---+---+------------+

answered Oct 18 '22 10:10

jxc

If you're using Spark >= 2.4.0 see jxc's answer below.

In Spark < 2.4.0, dataframes API didn't support -1 indexing on arrays, but you could write your own UDF or use built-in size() function, for example:

>>> from pyspark.sql.functions import size
>>> splitted = df.select(split(df.s, ' ').alias('arr'))
>>> splitted.select(splitted.arr[size(splitted.arr)-1]).show()
+--------------------+
|arr[(size(arr) - 1)]|
+--------------------+
|                   d|
+--------------------+

answered Oct 18 '22 09:10

Mariusz

Related questions
                            
                                Is dataframe.show() an action in spark?
                            
                                dynamically bind variable/parameter in Spark SQL?
                            
                                Spark UI on AWS EMR
                            
                                How to fix java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List to field type scala.collection.Seq?
                            
                                Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?
                            
                                Scala Error: Could not find or load main class in both Scala IDE and Eclipse
                            
                                How to configure Apache Spark random worker ports for tight firewalls?
                            
                                Where is the Spark UI on Google Dataproc?
                            
                                How to convert ArrayType to DenseVector in PySpark DataFrame?
                            
                                Executing separate streaming queries in spark structured streaming
                            
                                Unable to run a basic GraphFrames example
                            
                                unexpected type: <class 'pyspark.sql.types.DataTypeSingleton'> when casting to Int on a ApacheSpark Dataframe
                            
                                Link Spark with iPython Notebook
                            
                                How to fix "java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord" in Spark Streaming Kafka Consumer?
                            
                                Efficient way to read specific columns from parquet file in spark
                            
                                How to overwrite entire existing column in Spark dataframe with new column?
                            
                                Read whole text files from a compression in Spark
                            
                                Full outer join in pyspark data frames
                            
                                when to use mapParitions and mapPartitionsWithIndex?
                            
                                How to add column with constant in Spark-java data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With