A very huge DataFrame with schema: <pre class="prettyprint"><code>root |-- id: string (nullable = true) |-- ext: array (nullable = true) | |-- element: integer (containsNull = true) </code></pre> So far I try to <code>explode</code> data, then <code>collect_list</code>: <pre class="prettyprint"><code>select id, collect_list(cast(item as string)) from default.dual lateral view explode(ext) t as item group by id </code></pre> But this way is too expansive.

You can simply cast the <code>ext</code> column to a string array <pre class="prettyprint"><code>df = source.withColumn("ext", source.ext.cast("array<string>")) df.printSchema() df.show() </code></pre>

Is there any better way to convert Array<int> to Array<String> in pyspark

Tags:

apache-spark

apache-spark-sql

pyspark

spark-dataframe

A very huge DataFrame with schema:

root
 |-- id: string (nullable = true)
 |-- ext: array (nullable = true)
 |    |-- element: integer (containsNull = true)

So far I try to explode data, then collect_list:

select
  id,
  collect_list(cast(item as string))
from default.dual
lateral view explode(ext) t as item
group by
  id

But this way is too expansive.

488

asked Jan 05 '18 03:01

Zhang Tong

1 Answers

You can simply cast the ext column to a string array

df = source.withColumn("ext", source.ext.cast("array<string>"))
df.printSchema()
df.show()

answered Oct 10 '22 00:10

Silvio

Related questions
                            
                                (Spark) object {name} is not a member of package org.apache.spark.ml
                            
                                How to pass parameters / properties to Spark jobs with spark-submit
                            
                                How does range partitioner work in Spark?
                            
                                How to add new field to struct column?
                            
                                Stop Structured Streaming query gracefully
                            
                                Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster
                            
                                Convert scala list to DataFrame or DataSet
                            
                                Can't find spark submit when typing spark-shell
                            
                                spark-class: line 71...No such file or directory
                            
                                Convert Row to map in spark scala
                            
                                Error when Spark 2.2.0 standalone mode write Dataframe to local single-node Kafka
                            
                                How to rename duplicated columns after join? [duplicate]
                            
                                Who can give a clear explanation for `combineByKey` in Spark?
                            
                                How to get applicationId of Spark application deployed to YARN in Scala?
                            
                                How to use functions provide by DataFrameNaFunctions class in Spark, on a Dataframe?
                            
                                Spark UDF error - Schema for type Any is not supported
                            
                                Apache Spark: Difference between parallelize and broadcast
                            
                                Issue while opening Spark shell
                            
                                pyspark: counter part of like() method in dataframe
                            
                                Spark avoid creating _temporary directory in S3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With