I've seen various people suggesting that <code>Dataframe.explode</code> is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple: <pre class="prettyprint"><code>rdd.map(lambda row: row + [row.my_str_col.split('-')]) </code></pre> which takes something looking like: <pre class="prettyprint"><code>col1 | my_str_col -----+----------- 18 | 856-yygrm 201 | 777-psgdg </code></pre> and converts it to this: <pre class="prettyprint"><code>col1 | my_str_col | _col3 | _col4 -----+------------+-------+------ 18 | 856-yygrm | 856 | yygrm 201 | 777-psgdg | 777 | psgdg </code></pre> I am aware of <code>pyspark.sql.functions.split()</code>, but it results in a nested array column instead of two top-level columns like I want. Ideally, I want these new columns to be named as well.

<code>pyspark.sql.functions.split()</code> is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use <code>Column.getItem()</code> to retrieve each part of the array as a column itself: <pre class="prettyprint"><code>split_col = pyspark.sql.functions.split(df['my_str_col'], '-') df = df.withColumn('NAME1', split_col.getItem(0)) df = df.withColumn('NAME2', split_col.getItem(1)) </code></pre> The result will be: <pre class="prettyprint"><code>col1 | my_str_col | NAME1 | NAME2 -----+------------+-------+------ 18 | 856-yygrm | 856 | yygrm 201 | 777-psgdg | 777 | psgdg </code></pre> I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using <code>collect</code>, or using <code>udf</code>s. Unfortunately this only works for <code>spark</code> version 2.1 and above, because it requires the <code>posexplode</code> function. Suppose you had the following DataFrame: <pre class="prettyprint lang-python prettyprint-override"><code>df = spark.createDataFrame( [ [1, 'A, B, C, D'], [2, 'E, F, G'], [3, 'H, I'], [4, 'J'] ] , ["num", "letters"] ) df.show() #+---+----------+ #|num| letters| #+---+----------+ #| 1|A, B, C, D| #| 2| E, F, G| #| 3| H, I| #| 4| J| #+---+----------+ </code></pre> Split the <code>letters</code> column and then use <code>posexplode</code> to explode the resultant array along with the position in the array. Next use <code>pyspark.sql.functions.expr</code> to grab the element at index <code>pos</code> in this array. <pre class="prettyprint lang-python prettyprint-override"><code>import pyspark.sql.functions as f df.select( "num", f.split("letters", ", ").alias("letters"), f.posexplode(f.split("letters", ", ")).alias("pos", "val") )\ .show() #+---+------------+---+---+ #|num| letters|pos|val| #+---+------------+---+---+ #| 1|[A, B, C, D]| 0| A| #| 1|[A, B, C, D]| 1| B| #| 1|[A, B, C, D]| 2| C| #| 1|[A, B, C, D]| 3| D| #| 2| [E, F, G]| 0| E| #| 2| [E, F, G]| 1| F| #| 2| [E, F, G]| 2| G| #| 3| [H, I]| 0| H| #| 3| [H, I]| 1| I| #| 4| [J]| 0| J| #+---+------------+---+---+ </code></pre> Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of <code>letter</code> and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of <code>pyspark.sql.functions.expr</code> which allows us use column values as parameters. <pre class="prettyprint lang-python prettyprint-override"><code>df.select( "num", f.split("letters", ", ").alias("letters"), f.posexplode(f.split("letters", ", ")).alias("pos", "val") )\ .drop("val")\ .select( "num", f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"), f.expr("letters[pos]").alias("val") )\ .show() #+---+-------+---+ #|num| name|val| #+---+-------+---+ #| 1|letter0| A| #| 1|letter1| B| #| 1|letter2| C| #| 1|letter3| D| #| 2|letter0| E| #| 2|letter1| F| #| 2|letter2| G| #| 3|letter0| H| #| 3|letter1| I| #| 4|letter0| J| #+---+-------+---+ </code></pre> Now we can just <code>groupBy</code> the <code>num</code> and <code>pivot</code> the DataFrame. Putting that all together, we get: <pre class="prettyprint lang-python prettyprint-override"><code>df.select( "num", f.split("letters", ", ").alias("letters"), f.posexplode(f.split("letters", ", ")).alias("pos", "val") )\ .drop("val")\ .select( "num", f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"), f.expr("letters[pos]").alias("val") )\ .groupBy("num").pivot("name").agg(f.first("val"))\ .show() #+---+-------+-------+-------+-------+ #|num|letter0|letter1|letter2|letter3| #+---+-------+-------+-------+-------+ #| 1| A| B| C| D| #| 3| H| I| null| null| #| 2| E| F| G| null| #| 4| J| null| null| null| #+---+-------+-------+-------+-------+ </code></pre>

Split Spark Dataframe string column into multiple columns

Tags:

apache-spark

apache-spark-sql

pyspark

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:

Click to copy

rdd.map(lambda row: row + [row.my_str_col.split('-')])

which takes something looking like:

Click to copy

col1 | my_str_col -----+-----------   18 |  856-yygrm  201 |  777-psgdg

and converts it to this:

Click to copy

col1 | my_str_col | _col3 | _col4 -----+------------+-------+------   18 |  856-yygrm |   856 | yygrm  201 |  777-psgdg |   777 | psgdg

I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.

Ideally, I want these new columns to be named as well.

935

asked Aug 30 '16 19:08

Peter Gaultney

2 Answers

pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:

Click to copy

split_col = pyspark.sql.functions.split(df['my_str_col'], '-') df = df.withColumn('NAME1', split_col.getItem(0)) df = df.withColumn('NAME2', split_col.getItem(1))

The result will be:

Click to copy

col1 | my_str_col | NAME1 | NAME2 -----+------------+-------+------   18 |  856-yygrm |   856 | yygrm  201 |  777-psgdg |   777 | psgdg

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

111

answered Sep 25 '22 13:09

Peter Gaultney

Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.

Suppose you had the following DataFrame:

Click to copy

df = spark.createDataFrame(     [         [1, 'A, B, C, D'],          [2, 'E, F, G'],          [3, 'H, I'],          [4, 'J']     ]     , ["num", "letters"] ) df.show() #+---+----------+ #|num|   letters| #+---+----------+ #|  1|A, B, C, D| #|  2|   E, F, G| #|  3|      H, I| #|  4|         J| #+---+----------+

Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.

Click to copy

import pyspark.sql.functions as f  df.select(         "num",         f.split("letters", ", ").alias("letters"),         f.posexplode(f.split("letters", ", ")).alias("pos", "val")     )\     .show() #+---+------------+---+---+ #|num|     letters|pos|val| #+---+------------+---+---+ #|  1|[A, B, C, D]|  0|  A| #|  1|[A, B, C, D]|  1|  B| #|  1|[A, B, C, D]|  2|  C| #|  1|[A, B, C, D]|  3|  D| #|  2|   [E, F, G]|  0|  E| #|  2|   [E, F, G]|  1|  F| #|  2|   [E, F, G]|  2|  G| #|  3|      [H, I]|  0|  H| #|  3|      [H, I]|  1|  I| #|  4|         [J]|  0|  J| #+---+------------+---+---+

Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.

Click to copy

df.select(         "num",         f.split("letters", ", ").alias("letters"),         f.posexplode(f.split("letters", ", ")).alias("pos", "val")     )\     .drop("val")\     .select(         "num",         f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),         f.expr("letters[pos]").alias("val")     )\     .show() #+---+-------+---+ #|num|   name|val| #+---+-------+---+ #|  1|letter0|  A| #|  1|letter1|  B| #|  1|letter2|  C| #|  1|letter3|  D| #|  2|letter0|  E| #|  2|letter1|  F| #|  2|letter2|  G| #|  3|letter0|  H| #|  3|letter1|  I| #|  4|letter0|  J| #+---+-------+---+

Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:

Click to copy

df.select(         "num",         f.split("letters", ", ").alias("letters"),         f.posexplode(f.split("letters", ", ")).alias("pos", "val")     )\     .drop("val")\     .select(         "num",         f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),         f.expr("letters[pos]").alias("val")     )\     .groupBy("num").pivot("name").agg(f.first("val"))\     .show() #+---+-------+-------+-------+-------+ #|num|letter0|letter1|letter2|letter3| #+---+-------+-------+-------+-------+ #|  1|      A|      B|      C|      D| #|  3|      H|      I|   null|   null| #|  2|      E|      F|      G|   null| #|  4|      J|   null|   null|   null| #+---+-------+-------+-------+-------+

answered Sep 22 '22 13:09

pault

Related questions
                            
                                What is the relationship between workers, worker instances, and executors?
                            
                                Is it possible to get the current spark context settings in PySpark?
                            
                                How to pivot Spark DataFrame?
                            
                                how to make saveAsTextFile NOT split output into multiple file?
                            
                                How to prevent java.lang.OutOfMemoryError: PermGen space at Scala compilation?
                            
                                Pyspark: Exception: Java gateway process exited before sending the driver its port number
                            
                                How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
                            
                                Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey
                            
                                Which cluster type should I choose for Spark?
                            
                                How does HashPartitioner work?
                            
                                How to link PyCharm with PySpark?
                            
                                How to pass -D parameter or environment variable to Spark job?
                            
                                Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame
                            
                                How to write unit tests in Spark 2.0+?
                            
                                Updating a dataframe column in spark
                            
                                Spark SQL: apply aggregate functions to a list of columns
                            
                                Get current number of partitions of a DataFrame
                            
                                How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4
                            
                                Overwrite specific partitions in spark dataframe write method
                            
                                Concatenate two PySpark dataframes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split Spark Dataframe string column into multiple columns

Tags:

apache-spark

apache-spark-sql

pyspark

Peter Gaultney

People also ask

2 Answers

Peter Gaultney

pault

Recent Activity

Donate For Us