I've got a DF with columns of different time cycles (1/6, 3/6, 6/6 etc.) and would like to "explode" all the columns to create a new DF in which each row is a 1/6 cycle. <pre class="prettyprint"><code>from pyspark import Row from pyspark.sql import SparkSession from pyspark.sql.functions import explode, arrays_zip, col spark = SparkSession.builder \ .appName('DataFrame') \ .master('local[*]') \ .getOrCreate() df = spark.createDataFrame([Row(a=1, b=[1, 2, 3, 4, 5, 6], c=[11, 22, 33], d=['foo'])]) | a| b| c| d| +---+------------------+------------+-----+ | 1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]| +---+------------------+------------+-----+ </code></pre> I'm doing the explode: <pre class="prettyprint"><code>df2 = (df.withColumn("tmp", arrays_zip("b", "c", "d")) .withColumn("tmp", explode("tmp")) .select("a", col("tmp.b"), col("tmp.c"), "d")) </code></pre> But the output is not what I want: <pre class="prettyprint"><code>| a| b| c| d| +---+---+----+-----+ | 1| 1| 11|[foo]| | 1| 2| 22|[foo]| | 1| 3| 33|[foo]| | 1| 4|null|[foo]| | 1| 5|null|[foo]| | 1| 6|null|[foo]| +---+---+----+-----+ </code></pre> I would want it to look like this: <pre class="prettyprint"><code>| a| b| c| d| +---+---+---+---+ | 1| 1| 11|foo| | | 2| | | | | 3| 22| | | | 4| | | | | 5| 33| | | | 6| | | +---+---+---+---+ </code></pre> I am new to Spark and from the start I've got complicated topics ! :) Update 2019-07-15: Maybe someone has a solution without usage of UDFs? -> answered by @jxc Update 2019-07-17: Maybe someone has a solution how to change the null <-> values sequences in more complicated order? Like in <code>c</code> - <code>Null, 11, Null, 22, Null, 33</code> or more complex situation as we want in column <code>d</code> first value to be <code>Null</code>, next <code>foo</code> then <code>Null, Null, Null</code>: <pre class="prettyprint"><code>| a| b| c| d| +---+---+---+---+ | 1| 1| | | | | 2| 11|foo| | | 3| | | | | 4| 22| | | | 5| | | | | 6| 33| | +---+---+---+---+ </code></pre>

Here is one way without using udf: UPDATE on 2019/07/17: adjusted SQL stmt and added N=6 as parameter to SQL. UPDATE on 2019/07/16: removed the temporary column <code>t</code>, replaced with a constant <code>array(0,1,2,3,4,5)</code> in the transform function. In such case, we can operate on the value of the array elements directly instead of their indexes. UPDATE: I removed the original method which uses String functions and converts data types in the array elements all into String and less efficient. The Spark SQL higher-order functions with Spark 2.4+ should be better than the original method. <h3>Setup</h3> <pre class="prettyprint"><code>from pyspark.sql import functions as F, Row df = spark.createDataFrame([ Row(a=1, b=[1, 2, 3, 4, 5, 6], c=['11', '22', '33'], d=['foo'], e=[111,222]) ]) >>> df.show() +---+------------------+------------+-----+----------+ | a| b| c| d| e| +---+------------------+------------+-----+----------+ | 1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|[111, 222]| +---+------------------+------------+-----+----------+ # columns you want to do array-explode cols = df.columns # number of array elements to set N = 6 </code></pre> <h3>Using SQL higher-order function: transform </h3> Use the Spark SQL higher-order function: transform(), do the following: <ol> <li> create the following Spark SQL code where <code>{0}</code> will be replaced by the column_name, <code>{1}</code> will be replaced by <code>N</code>: <pre class="prettyprint"><code>stmt = ''' CASE WHEN '{0}' in ('d') THEN transform(sequence(0,{1}-1), x -> IF(x == 1, `{0}`[0], NULL)) WHEN size(`{0}`) <= {1}/2 AND size(`{0}`) > 1 THEN transform(sequence(0,{1}-1), x -> IF(((x+1)*size(`{0}`))%{1} == 0, `{0}`[int((x-1)*size(`{0}`)/{1})], NULL)) ELSE `{0}` END AS `{0}` ''' </code></pre> Note: array transformation only defined when array contains more than one (unless specified in a separate <code>WHEN</code> clause) and <code><= N/2</code> elements (in this example, <code>1 < size <= 3</code>). arrays with other size will be kept as-is. </li> <li> Run the above SQL with selectExpr() for all required columns <pre class="prettyprint"><code>df1 = df.withColumn('a', F.array('a')) \ .selectExpr(*[ stmt.format(c,N) for c in cols ]) >>> df1.show() +---+------------------+----------------+-----------+---------------+ | a| b| c| d| e| +---+------------------+----------------+-----------+---------------+ |[1]|[1, 2, 3, 4, 5, 6]|[, 11,, 22,, 33]|[, foo,,,,]|[,, 111,,, 222]| +---+------------------+----------------+-----------+---------------+ </code></pre> </li> <li> run arrays_zip and explode: <pre class="prettyprint"><code>df_new = df1.withColumn('vals', F.explode(F.arrays_zip(*cols))) \ .select('vals.*') \ .fillna('', subset=cols) >>> df_new.show() +----+---+---+---+----+ | a| b| c| d| e| +----+---+---+---+----+ | 1| 1| | |null| |null| 2| 11|foo|null| |null| 3| | | 111| |null| 4| 22| |null| |null| 5| | |null| |null| 6| 33| | 222| +----+---+---+---+----+ </code></pre> Note: <code>fillna('', subset=cols)</code> only changed columns containing Strings </li> </ol> <h3>In one method chain:</h3> <pre class="prettyprint"><code>df_new = df.withColumn('a', F.array('a')) \ .selectExpr(*[ stmt.format(c,N) for c in cols ]) \ .withColumn('vals', F.explode(F.arrays_zip(*cols))) \ .select('vals.*') \ .fillna('', subset=cols) </code></pre> <h3>Explanation with the transform function:</h3> The transform function (list below, reflect to an old revision of requirements) <pre class="prettyprint"><code>transform(sequence(0,5), x -> IF((x*size({0}))%6 == 0, {0}[int(x*size({0})/6)], NULL)) </code></pre> As mentioned in the post, <code>{0}</code> will be replaced with column name. Here we use column-<code>c</code> which contains 3 elements as an example: <ul> <li>In the transform function, <code>sequence(0,5)</code> creates a constant array <code>array(0,1,2,3,4,5)</code> with 6 elements, and the rest sets the lambda function with one argument <code>x</code> having the value of elements.</li> <li> IF(condition, true_value, false_value): is a standard SQL function</li> <li> The condition we applied is: <code>(x*size(c))%6 == 0</code> where <code>size(c)=3</code>, if this condition is true, it will return c[int(x*size(c)/6)], otherwise, return NULL. so for <code>x</code> from 0 to 5, we will have: <pre class="prettyprint"><code>((0*3)%6)==0) true --> c[int(0*3/6)] = c[0] ((1*3)%6)==0) false --> NULL ((2*3)%6)==0) true --> c[int(2*3/6)] = c[1] ((3*3)%6)==0) false --> NULL ((4*3)%6)==0) true --> c[int(4*3/6)] = c[2] ((5*3)%6)==0) false --> NULL </code></pre> </li> </ul> Similar to column-e which contains a 2-element array.

How to explode multiple columns, different types and different lengths?

Tags:

python

pyspark

I've got a DF with columns of different time cycles (1/6, 3/6, 6/6 etc.) and would like to "explode" all the columns to create a new DF in which each row is a 1/6 cycle.

from pyspark import Row 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import explode, arrays_zip, col

spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

df = spark.createDataFrame([Row(a=1, b=[1, 2, 3, 4, 5, 6], c=[11, 22, 33], d=['foo'])])

|  a|                 b|           c|    d|
+---+------------------+------------+-----+
|  1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|
+---+------------------+------------+-----+

I'm doing the explode:

df2 = (df.withColumn("tmp", arrays_zip("b", "c", "d"))
       .withColumn("tmp", explode("tmp"))
       .select("a", col("tmp.b"), col("tmp.c"), "d"))

But the output is not what I want:

|  a|  b|   c|    d|
+---+---+----+-----+
|  1|  1|  11|[foo]|
|  1|  2|  22|[foo]|
|  1|  3|  33|[foo]|
|  1|  4|null|[foo]|
|  1|  5|null|[foo]|
|  1|  6|null|[foo]|
+---+---+----+-----+

I would want it to look like this:

|  a|  b|  c|  d|
+---+---+---+---+
|  1|  1| 11|foo|
|   |  2|   |   |
|   |  3| 22|   |
|   |  4|   |   |
|   |  5| 33|   |
|   |  6|   |   |
+---+---+---+---+

I am new to Spark and from the start I've got complicated topics ! :)

Update 2019-07-15: Maybe someone has a solution without usage of UDFs? -> answered by @jxc

Update 2019-07-17: Maybe someone has a solution how to change the null <-> values sequences in more complicated order? Like in c - Null, 11, Null, 22, Null, 33 or more complex situation as we want in column d first value to be Null, next foo then Null, Null, Null:

|  a|  b|  c|  d|
+---+---+---+---+
|  1|  1|   |   |
|   |  2| 11|foo|
|   |  3|   |   |
|   |  4| 22|   |
|   |  5|   |   |
|   |  6| 33|   |
+---+---+---+---+

713

asked Jul 08 '19 08:07

cincin21

1 Answers

Here is one way without using udf:

UPDATE on 2019/07/17: adjusted SQL stmt and added N=6 as parameter to SQL.

UPDATE on 2019/07/16: removed the temporary column t, replaced with a constant array(0,1,2,3,4,5) in the transform function. In such case, we can operate on the value of the array elements directly instead of their indexes.

UPDATE: I removed the original method which uses String functions and converts data types in the array elements all into String and less efficient. The Spark SQL higher-order functions with Spark 2.4+ should be better than the original method.

Setup

from pyspark.sql import functions as F, Row

df = spark.createDataFrame([ Row(a=1, b=[1, 2, 3, 4, 5, 6], c=['11', '22', '33'], d=['foo'], e=[111,222]) ])

>>> df.show()
+---+------------------+------------+-----+----------+
|  a|                 b|           c|    d|         e|
+---+------------------+------------+-----+----------+
|  1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|[111, 222]|
+---+------------------+------------+-----+----------+

# columns you want to do array-explode
cols = df.columns

# number of array elements to set
N = 6

Using SQL higher-order function: transform

Use the Spark SQL higher-order function: transform(), do the following:

create the following Spark SQL code where {0} will be replaced by the column_name, {1} will be replaced by N:

stmt = '''
   CASE
      WHEN '{0}' in ('d') THEN
        transform(sequence(0,{1}-1), x -> IF(x == 1, `{0}`[0], NULL))
      WHEN size(`{0}`) <= {1}/2 AND size(`{0}`) > 1 THEN
        transform(sequence(0,{1}-1), x -> IF(((x+1)*size(`{0}`))%{1} == 0, `{0}`[int((x-1)*size(`{0}`)/{1})], NULL))
      ELSE `{0}`
    END AS `{0}`
'''

Note: array transformation only defined when array contains more than one (unless specified in a separate WHEN clause) and <= N/2 elements (in this example, 1 < size <= 3). arrays with other size will be kept as-is.

Run the above SQL with selectExpr() for all required columns

df1 = df.withColumn('a', F.array('a')) \
        .selectExpr(*[ stmt.format(c,N) for c in cols ])

>>> df1.show()
+---+------------------+----------------+-----------+---------------+
|  a|                 b|               c|          d|              e|
+---+------------------+----------------+-----------+---------------+
|[1]|[1, 2, 3, 4, 5, 6]|[, 11,, 22,, 33]|[, foo,,,,]|[,, 111,,, 222]|
+---+------------------+----------------+-----------+---------------+

run arrays_zip and explode:

df_new = df1.withColumn('vals', F.explode(F.arrays_zip(*cols))) \
            .select('vals.*') \
            .fillna('', subset=cols)

>>> df_new.show()
+----+---+---+---+----+
|   a|  b|  c|  d|   e|
+----+---+---+---+----+
|   1|  1|   |   |null|
|null|  2| 11|foo|null|
|null|  3|   |   | 111|
|null|  4| 22|   |null|
|null|  5|   |   |null|
|null|  6| 33|   | 222|
+----+---+---+---+----+

Note: fillna('', subset=cols) only changed columns containing Strings

In one method chain:

df_new = df.withColumn('a', F.array('a')) \
           .selectExpr(*[ stmt.format(c,N) for c in cols ]) \
           .withColumn('vals', F.explode(F.arrays_zip(*cols))) \
           .select('vals.*') \
           .fillna('', subset=cols)

Explanation with the transform function:

The transform function (list below, reflect to an old revision of requirements)

transform(sequence(0,5), x -> IF((x*size({0}))%6 == 0, {0}[int(x*size({0})/6)], NULL))

As mentioned in the post, {0} will be replaced with column name. Here we use column-c which contains 3 elements as an example:

In the transform function, sequence(0,5) creates a constant array array(0,1,2,3,4,5) with 6 elements, and the rest sets the lambda function with one argument x having the value of elements.
IF(condition, true_value, false_value): is a standard SQL function

The condition we applied is: (x*size(c))%6 == 0 where size(c)=3, if this condition is true, it will return c[int(x*size(c)/6)], otherwise, return NULL. so for x from 0 to 5, we will have:

((0*3)%6)==0) true   -->  c[int(0*3/6)] = c[0]
((1*3)%6)==0) false  -->  NULL
((2*3)%6)==0) true   -->  c[int(2*3/6)] = c[1]
((3*3)%6)==0) false  -->  NULL
((4*3)%6)==0) true   -->  c[int(4*3/6)] = c[2]
((5*3)%6)==0) false  -->  NULL

Similar to column-e which contains a 2-element array.

199

answered Sep 21 '22 21:09

jxc

Related questions
                            
                                Why do Python floats have real and imag attributes?
                            
                                How do I view the XML produced by the python-docx package
                            
                                str.translate vs str.replace - When to use which one?
                            
                                Repeated insertions into sqlite database via sqlalchemy causing memory leak?
                            
                                Bring a few columns to the front in a huge Panda DataFrame
                            
                                Enumerating a tuple of indices with itertools.product
                            
                                Google Cloud Composer taking too long to install dependencies
                            
                                pytest-django: Is this the right way to test view with parameters?
                            
                                Python + Flask REST API, how to convert data keys between camelcase and snakecase?
                            
                                Test Google SSO SAML on Localhost
                            
                                Why do ProcessPoolExecutor and Pool crash with a super() call?
                            
                                How to fix "Invalid password format or unknown hashing algorithm." in a custom User Model Django
                            
                                Airflow - got an unexpected keyword argument 'dag'
                            
                                Compress data into smallest amount of text?
                            
                                Implement Causal CNN in Keras for multivariate time-series prediction
                            
                                AttributeError: 'FrozenImporter' object has no attribute 'filename'
                            
                                reindex MultiIndex on a level with dates that are "close"
                            
                                Custom Datagenerator
                            
                                How to use from x import y using importlib in Python
                            
                                Can you import any python library in Julia?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With