Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle string to array conversion in pyspark dataframe

I have a file(csv) which when read in spark dataframe has the below values for print schema

-- list_values: string (nullable = true)

the values in the column list_values are something like:

[[[167, 109, 80, ...]]]

Is it possible to convert this to array type instead of string?

I tried splitting it and using code available online for similar problems:

df_1 = df.select('list_values', split(col("list_values"), ",\s*").alias("list_values"))

but if I run the above code the array which I get skips a lot of values in the original array i.e.

output of the above code is:

[, 109, 80, 69, 5...

which is different from original array i.e. (-- 167 is missing)

[[[167, 109, 80, ...]]] 

Since I am new to spark I don't have much knowledge how it is done (For python I could have done ast.literal_eval but spark has no provision for this.

So I'll repeat the question again :

How can I convert/cast an array stored as string to array i.e.

'[]' to [] conversion
like image 472
kunal Avatar asked Jan 01 '23 16:01

kunal


1 Answers

Suppose your DataFrame was the following:

df.show()
#+----+------------------+
#|col1|              col2|
#+----+------------------+
#|   a|[[[167, 109, 80]]]|
#+----+------------------+

df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)

You could use pyspark.sql.functions.regexp_replace to remove the leading and trailing square brackets. Once that's done, you can split the resulting string on ", ":

from pyspark.sql.functions import split, regexp_replace

df2 = df.withColumn(
    "col3",
    split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ")
)
df2.show()

#+----+------------------+--------------+
#|col1|              col2|          col3|
#+----+------------------+--------------+
#|   a|[[[167, 109, 80]]]|[167, 109, 80]|
#+----+------------------+--------------+

df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# |    |-- element: string (containsNull = true)

If you wanted the column as an array of integers, you could use cast:

from pyspark.sql.functions import col
df2 = df2.withColumn("col3", col("col3").cast("array<int>"))
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# |    |-- element: integer (containsNull = true)
like image 153
pault Avatar answered Mar 04 '23 13:03

pault