Why does column 1st_from_end
contain null:
from pyspark.sql.functions import split
df = sqlContext.createDataFrame([('a b c d',)], ['s',])
df.select( split(df.s, ' ')[0].alias('0th'),
split(df.s, ' ')[3].alias('3rd'),
split(df.s, ' ')[-1].alias('1st_from_end')
).show()
I thought using [-1]
was a pythonic way to get the last item in a list. How come it doesn't work in pyspark?
element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.
Use tail() action to get the Last N rows from a DataFrame, this returns a list of class Row for PySpark and Array[Row] for Spark with Scala.
In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices.
Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
For Spark 2.4+, use pyspark.sql.functions.element_at, see below from the documentation:
element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.
from pyspark.sql.functions import element_at, split, col
df = spark.createDataFrame([('a b c d',)], ['s',])
df.withColumn('arr', split(df.s, ' ')) \
.select( col('arr')[0].alias('0th')
, col('arr')[3].alias('3rd')
, element_at(col('arr'), -1).alias('1st_from_end')
).show()
+---+---+------------+
|0th|3rd|1st_from_end|
+---+---+------------+
| a| d| d|
+---+---+------------+
If you're using Spark >= 2.4.0 see jxc's answer below.
In Spark < 2.4.0, dataframes API didn't support -1
indexing on arrays, but you could write your own UDF or use built-in size()
function, for example:
>>> from pyspark.sql.functions import size
>>> splitted = df.select(split(df.s, ' ').alias('arr'))
>>> splitted.select(splitted.arr[size(splitted.arr)-1]).show()
+--------------------+
|arr[(size(arr) - 1)]|
+--------------------+
| d|
+--------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With