Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get first element in array Pyspark

Tags:

pyspark

I want to add new 2 columns value services arr first and second value but I'm getting the error:

Field name should be String Literal, but it's 0;

production_target_datasource_df.withColumn("newcol",production_target_datasource_df["Services"].getItem(0))
    +------------------+--------------------+
    |         cid      |            Services|
    +------------------+--------------------+
    |845124826013182686|     [112931, serv1]|
    |845124826013182686|     [146936, serv1]|
    |845124826013182686|      [32718, serv2]|
    |845124826013182686|      [28839, serv2]|
    |845124826013182686|       [8710, serv2]|
    |845124826013182686|    [2093140, serv3]|
like image 564
xxxerneaxx Avatar asked Oct 27 '25 10:10

xxxerneaxx


2 Answers

You don't have to use .getItem(0)

production_target_datasource_df["Services"][0] would be enough.

# Constructing your table:
from pyspark.sql import Row

df = sc.parallelize([Row(cid=1,Services=["2", "serv1"]),
Row(cid=1, Services=["3", "serv1"]),
Row(cid=1, Services=["4", "serv2"])]).toDF()
df.show()
+---+----------+
|cid|  Services|
+---+----------+
|  1|[2, serv1]|
|  1|[3, serv1]|
|  1|[4, serv2]|
+---+----------+

# Adding the two columns:
new_df = df.withColumn("first_element", df.Services[0])
new_df = new_df.withColumn("second_element", df.Services[1])
new_df.show()

+---+----------+-------------+--------------+
|cid|  Services|first_element|second_element|
+---+----------+-------------+--------------+
|  1|[2, serv1]|            2|         serv1|
|  1|[3, serv1]|            3|         serv1|
|  1|[4, serv2]|            4|         serv2|
+---+----------+-------------+--------------+
like image 164
Cena Avatar answered Oct 29 '25 09:10

Cena


As the error is saying, you need to pass a string not a 0. Then, you wonder : what string should I pass ?

If you follow @pault advice, and printSchema, you will actually know what are the corresponding keys to your values in the list.

Here is the documentation of getItem, helping you figure this out. enter image description here

Another way to know what to pass, is to simply pass any string, you could type:

production_target_datasource_df.withColumn("newcol",production_target_datasource_df["Services"].getItem('0'))

and the logs will tell you what keys were expected.

Hope this helps ;)

like image 21
MichaelU Avatar answered Oct 29 '25 07:10

MichaelU



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!