I have a pyspark dataframe where some of its columns contain array of string (and one column contains nested array). As a result, I cannot write the dataframe to a csv.
Here is an example of the dataframe that I am dealing with -
+-------+--------------------+---------+
|ID | emailed| clicked
+-------+--------------------+---------+
|9000316|[KBR, NRT, AOR] |[[AOR]]
|9000854|[KBR, NRT, LAX] | Null
|9001996|[KBR, JFK] |[[JFK]]
+-------+--------------------+---------+
I would like to get the following structure, to be saved as a csv.
+-------+--------------------+---------+
|ID | emailed| clicked
+-------+--------------------+---------+
|9000316|KBR, NRT, AOR | AOR
|9000854|KBR, NRT, LAX | Null
|9001996|KBR, JFK | JFK
+-------+--------------------+---------+
I am very new to pyspark. Your help is greatly appreciated. Thank you!
Can you try this way. You will have to import the module
import pyspark.sql.functions.*
df.select(concat_ws(',', split(df.emailed)).alias('string_form')).collect()
Let me know if that helps.
-----Update----
Code explained in the link, I modified a bit.
from pyspark.sql.functions import *
from pyspark.sql.types import *
def getter(column):
col_new=''
for i,col in enumerate(column):
if i==0:
col_new=col
else:
col_new=col_new+','+col
return col_new
getterUDF = udf(getter, StringType())
df.select(getterUDF(Ur_Array_Column))
You can try this as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With