Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert array<string> into string pyspark dataframe

I have a pyspark dataframe where some of its columns contain array of string (and one column contains nested array). As a result, I cannot write the dataframe to a csv.

Here is an example of the dataframe that I am dealing with -

    +-------+--------------------+---------+
    |ID     |             emailed| clicked
    +-------+--------------------+---------+
    |9000316|[KBR, NRT, AOR]     |[[AOR]]  
    |9000854|[KBR, NRT, LAX]     | Null 
    |9001996|[KBR, JFK]          |[[JFK]] 
    +-------+--------------------+---------+

I would like to get the following structure, to be saved as a csv.

    +-------+--------------------+---------+
    |ID     |             emailed| clicked
    +-------+--------------------+---------+
    |9000316|KBR, NRT, AOR       | AOR  
    |9000854|KBR, NRT, LAX       | Null 
    |9001996|KBR, JFK            | JFK 
    +-------+--------------------+---------+

I am very new to pyspark. Your help is greatly appreciated. Thank you!

like image 781
user42361 Avatar asked Sep 11 '17 15:09

user42361


1 Answers

Can you try this way. You will have to import the module

import pyspark.sql.functions.*
df.select(concat_ws(',', split(df.emailed)).alias('string_form')).collect()

Let me know if that helps.

-----Update----

Code explained in the link, I modified a bit.

from pyspark.sql.functions import *
from pyspark.sql.types import *

def getter(column):
    col_new=''
    for i,col in enumerate(column):
        if i==0:
           col_new=col
        else:
           col_new=col_new+','+col
    return col_new

getterUDF = udf(getter, StringType())

df.select(getterUDF(Ur_Array_Column))

You can try this as well.

like image 164
Manu Gupta Avatar answered Dec 31 '22 20:12

Manu Gupta