I have: <pre class="prettyprint"><code>key value a [1,2,3] b [2,3,4] </code></pre> I want: <pre class="prettyprint"><code>key value1 value2 value3 a 1 2 3 b 2 3 4 </code></pre> It seems that in scala I can write:<code>df.select($"value._1", $"value._2", $"value._3")</code>, but it is not possible in python. So is there a good way to do this?

It depends on the type of your "list": <ul> <li> If it is of type <code>ArrayType()</code>: <pre class="prettyprint lang-py prettyprint-override"><code>df = hc.createDataFrame(sc.parallelize([['a', [1,2,3]], ['b', [2,3,4]]]), ["key", "value"]) df.printSchema() df.show() root |-- key: string (nullable = true) |-- value: array (nullable = true) | |-- element: long (containsNull = true) </code></pre> you can access the values like you would with python using <code>[]</code>: <pre class="prettyprint lang-py prettyprint-override"><code>df.select("key", df.value[0], df.value[1], df.value[2]).show() +---+--------+--------+--------+ |key|value[0]|value[1]|value[2]| +---+--------+--------+--------+ | a| 1| 2| 3| | b| 2| 3| 4| +---+--------+--------+--------+ +---+-------+ |key| value| +---+-------+ | a|[1,2,3]| | b|[2,3,4]| +---+-------+ </code></pre> </li> <li> If it is of type <code>StructType()</code>: (maybe you built your dataframe by reading a JSON) <pre class="prettyprint lang-py prettyprint-override"><code>df2 = df.select("key", psf.struct( df.value[0].alias("value1"), df.value[1].alias("value2"), df.value[2].alias("value3") ).alias("value")) df2.printSchema() df2.show() root |-- key: string (nullable = true) |-- value: struct (nullable = false) | |-- value1: long (nullable = true) | |-- value2: long (nullable = true) | |-- value3: long (nullable = true) +---+-------+ |key| value| +---+-------+ | a|[1,2,3]| | b|[2,3,4]| +---+-------+ </code></pre> you can directly 'split' the column using <code>*</code>: <pre class="prettyprint lang-py prettyprint-override"><code>df2.select('key', 'value.*').show() +---+------+------+------+ |key|value1|value2|value3| +---+------+------+------+ | a| 1| 2| 3| | b| 2| 3| 4| +---+------+------+------+ </code></pre> </li> </ul>

I needed to unlist a 712 dimensional array into columns in order to write it to csv. I used @MaFF's solution first for my problem but that seemed to cause a lot of errors and additional computation time. I am not sure what was causing it, but I used a different method which reduced the computation time considerably (22 minutes compared to more than 4 hours)! Method by @MaFF's: <pre class="prettyprint"><code>length = len(dataset.head()["list_col"]) dataset = dataset.select(dataset.columns + [dataset["list_col"][k] for k in range(length)]) </code></pre> What I used: <pre class="prettyprint"><code>dataset = dataset.rdd.map(lambda x: (*x, *x["list_col"])).toDF() </code></pre> If someone has any ideas what was causing this difference in computational time, please let me know! I suspect that in my case the bottleneck was with calling <code>head()</code> to get the list length (which I would like be be adaptive). And because (i) my data pipeline was quite long and exhaustive, and (ii) I had to unlist multiple columns. Furthermore caching the entire dataset was not an option.

For arraytype data, to do it dynamically, you can do something like <pre class="prettyprint"><code>df2.select(['key'] + [df2.features[x] for x in range(0,3)]) </code></pre>

How to split a list to multiple columns in Pyspark?

Tags:

apache-spark

apache-spark-sql

pyspark

I have:

key   value
a    [1,2,3]
b    [2,3,4]

I want:

key value1 value2 value3
a     1      2      3
b     2      3      4

It seems that in scala I can write:df.select($"value._1", $"value._2", $"value._3"), but it is not possible in python.

So is there a good way to do this?

488

asked Aug 21 '17 04:08

DarkZero

4 Answers

It depends on the type of your "list":

If it is of type ArrayType():

df = hc.createDataFrame(sc.parallelize([['a', [1,2,3]], ['b', [2,3,4]]]), ["key", "value"]) df.printSchema() df.show() root  |-- key: string (nullable = true)  |-- value: array (nullable = true)  |    |-- element: long (containsNull = true)

you can access the values like you would with python using []:

df.select("key", df.value[0], df.value[1], df.value[2]).show() +---+--------+--------+--------+ |key|value[0]|value[1]|value[2]| +---+--------+--------+--------+ |  a|       1|       2|       3| |  b|       2|       3|       4| +---+--------+--------+--------+  +---+-------+ |key|  value| +---+-------+ |  a|[1,2,3]| |  b|[2,3,4]| +---+-------+

If it is of type StructType(): (maybe you built your dataframe by reading a JSON)

df2 = df.select("key", psf.struct(         df.value[0].alias("value1"),          df.value[1].alias("value2"),          df.value[2].alias("value3")     ).alias("value")) df2.printSchema() df2.show() root  |-- key: string (nullable = true)  |-- value: struct (nullable = false)  |    |-- value1: long (nullable = true)  |    |-- value2: long (nullable = true)  |    |-- value3: long (nullable = true)  +---+-------+ |key|  value| +---+-------+ |  a|[1,2,3]| |  b|[2,3,4]| +---+-------+

you can directly 'split' the column using *:

df2.select('key', 'value.*').show() +---+------+------+------+ |key|value1|value2|value3| +---+------+------+------+ |  a|     1|     2|     3| |  b|     2|     3|     4| +---+------+------+------+

132

answered Oct 18 '22 09:10

MaFF

I'd like to add the case of sized lists (arrays) to pault answer.

In the case that our column contains medium sized arrays (or large sized ones) it is still possible to split them in columns.

from pyspark.sql.types import *          # Needed to define DataFrame Schema. from pyspark.sql.functions import expr     # Define schema to create DataFrame with an array typed column. mySchema = StructType([StructField("V1", StringType(), True),                        StructField("V2", ArrayType(IntegerType(),True))])  df = spark.createDataFrame([['A', [1, 2, 3, 4, 5, 6, 7]],                              ['B', [8, 7, 6, 5, 4, 3, 2]]], schema= mySchema)  # Split list into columns using 'expr()' in a comprehension list. arr_size = 7 df = df.select(['V1', 'V2']+[expr('V2[' + str(x) + ']') for x in range(0, arr_size)])  # It is posible to define new column names. new_colnames = ['V1', 'V2'] + ['val_' + str(i) for i in range(0, arr_size)]  df = df.toDF(*new_colnames)

The result is:

df.show(truncate= False)  +---+---------------------+-----+-----+-----+-----+-----+-----+-----+ |V1 |V2                   |val_0|val_1|val_2|val_3|val_4|val_5|val_6| +---+---------------------+-----+-----+-----+-----+-----+-----+-----+ |A  |[1, 2, 3, 4, 5, 6, 7]|1    |2    |3    |4    |5    |6    |7    | |B  |[8, 7, 6, 5, 4, 3, 2]|8    |7    |6    |5    |4    |3    |2    | +---+---------------------+-----+-----+-----+-----+-----+-----+-----+

answered Oct 18 '22 09:10

Jordi Aceiton

I needed to unlist a 712 dimensional array into columns in order to write it to csv. I used @MaFF's solution first for my problem but that seemed to cause a lot of errors and additional computation time. I am not sure what was causing it, but I used a different method which reduced the computation time considerably (22 minutes compared to more than 4 hours)!

Method by @MaFF's:

length = len(dataset.head()["list_col"])
dataset = dataset.select(dataset.columns + [dataset["list_col"][k] for k in range(length)])

What I used:

dataset = dataset.rdd.map(lambda x: (*x, *x["list_col"])).toDF()

If someone has any ideas what was causing this difference in computational time, please let me know! I suspect that in my case the bottleneck was with calling head() to get the list length (which I would like be be adaptive). And because (i) my data pipeline was quite long and exhaustive, and (ii) I had to unlist multiple columns. Furthermore caching the entire dataset was not an option.

answered Oct 18 '22 09:10

thijsvdp

For arraytype data, to do it dynamically, you can do something like

df2.select(['key'] + [df2.features[x] for x in range(0,3)])

answered Oct 18 '22 09:10

VarunKumar

Related questions
                            
                                Spark SQL - PostgreSQL JDBC Classpath Issues
                            
                                Does caching in spark streaming increase performance
                            
                                Proper way to make a Spark Fat Jar using SBT
                            
                                How to get good performance on reading cassandra partitions in spark?
                            
                                Are recursive computations with Apache Spark RDD possible?
                            
                                Spark-submit class not found exception
                            
                                Loading bigger than memory hdf5 file in pyspark
                            
                                What operations of spark is processed in parallel?
                            
                                Spark MlLib linear regression (Linear least squares) giving random results
                            
                                SparkSQL DataFrame order by across partitions
                            
                                Spark job running out of heap memory on takeSample
                            
                                Pyspark module not found
                            
                                How to load csv file into SparkR on RStudio?
                            
                                SparkR bottleneck in createDataFrame?
                            
                                java.io.IOException: Not a data file
                            
                                Why is "Cannot call methods on a stopped SparkContext" thrown when connecting to Spark Standalone from Java application?
                            
                                Spark: run an external process in parallel
                            
                                Import error during unit test while calling a function from reduceByKey()
                            
                                Interpretting Spark Stage Output Log
                            
                                Spark SQL window function with complex condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split a list to multiple columns in Pyspark?

Tags:

apache-spark

apache-spark-sql

pyspark

DarkZero

People also ask

4 Answers

MaFF

Jordi Aceiton

thijsvdp

VarunKumar

Recent Activity

Donate For Us