How to extract an element from a array in pyspark

Tags:

I have a data frame with following type:

col1|col2|col3|col4
xxxx|yyyy|zzzz|[1111],[2222]

I want my output to be following type:

col1|col2|col3|col4|col5
xxxx|yyyy|zzzz|1111|2222

My col4 is an array and I want to convert it to a separate column. What needs to be done?

I saw many answers with flatMap, but they are increasing a row, I want just the tuple to be put in another column but in the same row

The following is my actual schema:

root
 |-- PRIVATE_IP: string (nullable = true)
 |-- PRIVATE_PORT: integer (nullable = true)
 |-- DESTINATION_IP: string (nullable = true)
 |-- DESTINATION_PORT: integer (nullable = true)
 |-- collect_set(TIMESTAMP): array (nullable = true)
 |    |-- element: string (containsNull = true)

Also, can please someone help me with explanation on both dataframes and RDD's.

774

asked Jul 22 '17 13:07

AnmolDave

2 Answers

Create sample data:

from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234])]
rdd = sc.parallelize([Row(col1="xx", col2="yy", col3="zz", col4=[123,234])])
df = spark.createDataFrame(rdd)
df.show()
#+----+----+----+----------+
#|col1|col2|col3|      col4|
#+----+----+----+----------+
#|  xx|  yy|  zz|[123, 234]|
#+----+----+----+----------+

Use getItem to extract element from the array column as this, in your actual case replace col4 with collect_set(TIMESTAMP):

df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
df.show()
#+----+----+----+----+----+
#|col1|col2|col3|col4|col5|
#+----+----+----+----+----+
#|  xx|  yy|  zz| 123| 234|
#+----+----+----+----+----+

158

answered Oct 06 '22 12:10

Psidom

You have 4 options to extract the value inside the array:

df = spark.createDataFrame([[1, [10, 20, 30, 40]]], ['A', 'B'])
df.show()

+---+----------------+
|  A|               B|
+---+----------------+
|  1|[10, 20, 30, 40]|
+---+----------------+

from pyspark.sql import functions as F

df.select(
    "A",
    df.B[0].alias("B0"), # dot notation and index        
    F.col("B")[1].alias("B1"), # function col and index
    df.B.getItem(2).alias("B2"), # dot notation and method getItem
    F.col("B").getItem(3).alias("B3"), # function col and method getItem
).show()

+---+---+---+---+---+
|  A| B0| B1| B2| B3|
+---+---+---+---+---+
|  1| 10| 20| 30| 40|
+---+---+---+---+---+

In case you have many columns, use a list comprehension:

df.select(
    'A', *[F.col('B')[i].alias(f'B{i}') for i in range(4)]
).show()

+---+---+---+---+---+
|  A| B0| B1| B2| B3|
+---+---+---+---+---+
|  1| 10| 20| 30| 40|
+---+---+---+---+---+

answered Oct 06 '22 11:10

Mykola Zotko

Related questions
                            
                                Is 1 << 31 well defined in C when sizeof(int) == 4
                            
                                Angular 4 execute function from another component
                            
                                Android Studio 3 (All Versions) - Device file explorer nothing to show
                            
                                AWS EC2 - Can't launch an instance - Account blocked
                            
                                Access key from child component in vue
                            
                                how to search for a directory from the terminal in ubuntu
                            
                                Keras AttributeError: 'list' object has no attribute 'ndim'
                            
                                Unrecognized VM option 'UseParNewGC' , Error: Could not create the Java Virtual Machine
                            
                                Add ld+json script tag in client-side React
                            
                                Forwarding formControlName to inner component in Angular
                            
                                TypeError: can’t convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first
                            
                                How to create dynamic route in gatsby

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract an element from a array in pyspark

Tags:

python

apache-spark

rdd

pyspark

AnmolDave

People also ask

2 Answers

Psidom

Mykola Zotko

Recent Activity

Donate For Us