Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select array element from Spark Dataframes split method in same call?

I'm splitting an HTTP request to look at the elements, and I was wondering if there was a way to specify the element I'd like to look at in the same call without having to do another operation.

For example:

from pyspark.sql import functions as fn

df.select(fn.split(df.http_request, '/').alias('http'))

gives me a new Dataframe with rows of arrays like this:

+--------------------+
|                http|
+--------------------+
|[, courses, 26420...|

I want the item in index 1 (courses) without having to then do another select statement to specify df.select(df.http[1]) or whatever. Is this possible?

like image 622
flybonzai Avatar asked Jun 07 '16 21:06

flybonzai


1 Answers

Use getItem. I'd say don't use python UDF just to make the code looks prettier - it's much slower than native DataFrame functions (due to moving data between python and JVM).

from pyspark.sql import functions as F
df.select(F.split(df.http_request, '/').alias('http').getItem(1))
like image 100
max Avatar answered Nov 14 '22 02:11

max