I have a PySpark DataFrame with 2 ArrayType fields: <pre class="prettyprint"><code>>>>df DataFrame[id: string, tokens: array<string>, bigrams: array<string>] >>>df.take(1) [Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])] </code></pre> I would like to combine them into a single ArrayType field: <pre class="prettyprint"><code>>>>df2 DataFrame[id: string, tokens_bigrams: array<string>] >>>df2.take(1) [Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])] </code></pre> The syntax that works with strings does not seem to work here: <pre class="prettyprint"><code>df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams) </code></pre> Thanks!

Spark >= 2.4 You can use <code>concat</code> function (SPARK-23736): <pre class="prettyprint"><code>from pyspark.sql.functions import col, concat df.select(concat(col("tokens"), col("tokens_bigrams"))).show(truncate=False) # +---------------------------------+ # |concat(tokens, tokens_bigrams) | # +---------------------------------+ # |[one, two, two, one two, two two]| # |null | # +---------------------------------+ </code></pre> To keep data when one of the values is <code>NULL</code> you can <code>coalesce</code> with <code>array</code>: <pre class="prettyprint"><code>from pyspark.sql.functions import array, coalesce df.select(concat( coalesce(col("tokens"), array()), coalesce(col("tokens_bigrams"), array()) )).show(truncate = False) # +--------------------------------------------------------------------+ # |concat(coalesce(tokens, array()), coalesce(tokens_bigrams, array()))| # +--------------------------------------------------------------------+ # |[one, two, two, one two, two two] | # |[three] | # +--------------------------------------------------------------------+ </code></pre> Spark < 2.4 Unfortunately to concatenate <code>array</code> columns in general case you'll need an UDF, for example like this: <pre class="prettyprint"><code>from itertools import chain from pyspark.sql.functions import col, udf from pyspark.sql.types import * def concat(type): def concat_(*args): return list(chain.from_iterable((arg if arg else [] for arg in args))) return udf(concat_, ArrayType(type)) </code></pre> which can be used as: <pre class="prettyprint"><code>df = spark.createDataFrame( [(["one", "two", "two"], ["one two", "two two"]), (["three"], None)], ("tokens", "tokens_bigrams") ) concat_string_arrays = concat(StringType()) df.select(concat_string_arrays("tokens", "tokens_bigrams")).show(truncate=False) # +---------------------------------+ # |concat_(tokens, tokens_bigrams) | # +---------------------------------+ # |[one, two, two, one two, two two]| # |[three] | # +---------------------------------+ </code></pre>

Combine PySpark DataFrame ArrayType fields into single ArrayType field

Tags:

I have a PySpark DataFrame with 2 ArrayType fields:

>>>df DataFrame[id: string, tokens: array<string>, bigrams: array<string>] >>>df.take(1) [Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])]

I would like to combine them into a single ArrayType field:

>>>df2 DataFrame[id: string, tokens_bigrams: array<string>] >>>df2.take(1) [Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])]

The syntax that works with strings does not seem to work here:

df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams)

Thanks!

401

asked May 17 '16 18:05

zemekeneng

1 Answers

Spark >= 2.4

You can use concat function (SPARK-23736):

from pyspark.sql.functions import col, concat   df.select(concat(col("tokens"), col("tokens_bigrams"))).show(truncate=False)  # +---------------------------------+                                              # |concat(tokens, tokens_bigrams)   | # +---------------------------------+ # |[one, two, two, one two, two two]| # |null                             | # +---------------------------------+

To keep data when one of the values is NULL you can coalesce with array:

from pyspark.sql.functions import array, coalesce        df.select(concat(     coalesce(col("tokens"), array()),     coalesce(col("tokens_bigrams"), array()) )).show(truncate = False)  # +--------------------------------------------------------------------+ # |concat(coalesce(tokens, array()), coalesce(tokens_bigrams, array()))| # +--------------------------------------------------------------------+ # |[one, two, two, one two, two two]                                   | # |[three]                                                             | # +--------------------------------------------------------------------+

Spark < 2.4

Unfortunately to concatenate array columns in general case you'll need an UDF, for example like this:

from itertools import chain from pyspark.sql.functions import col, udf from pyspark.sql.types import *   def concat(type):     def concat_(*args):         return list(chain.from_iterable((arg if arg else [] for arg in args)))     return udf(concat_, ArrayType(type))

which can be used as:

df = spark.createDataFrame(     [(["one", "two", "two"], ["one two", "two two"]), (["three"], None)],      ("tokens", "tokens_bigrams") )  concat_string_arrays = concat(StringType()) df.select(concat_string_arrays("tokens", "tokens_bigrams")).show(truncate=False)  # +---------------------------------+ # |concat_(tokens, tokens_bigrams)  | # +---------------------------------+ # |[one, two, two, one two, two two]| # |[three]                          | # +---------------------------------+

120

answered Oct 28 '22 05:10

zero323

Related questions
                            
                                Firebase Module install on ios
                            
                                How to disable/remove FirebaseAnalytics
                            
                                Visual Studio 15 Compilation Fails- clean project
                            
                                ReactJS: Why use this.props.children?
                            
                                document.currentScript is null
                            
                                Stuck with adding target to button programmatically
                            
                                Code Signature Invalid
                            
                                How to define enum values that are functions?
                            
                                Is there a case where vararg functions should be preferred over variadic templates?
                            
                                How to access the camera from within a Webview?
                            
                                Encode '+' using URLComponents in Swift
                            
                                ggplot2: Fix colors to factor levels

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With