Pyspark: how to duplicate a row n time in dataframe?

Tags:

I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one:

And transform like this:

I think I should use explode, but I don't understand how it works...
Thanks

220

asked May 31 '18 12:05

Chjul

2 Answers

With Spark 2.4.0+, this is easier with builtin functions: array_repeat + explode:

from pyspark.sql.functions import expr

df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)], ["A", "B", "n"])

new_df = df.withColumn('n', expr('explode(array_repeat(n,int(n)))'))

>>> new_df.show()
+---+---+---+
|  A|  B|  n|
+---+---+---+
|  1|  2|  1|
|  2|  9|  1|
|  3|  8|  2|
|  3|  8|  2|
|  4|  1|  1|
|  5|  3|  3|
|  5|  3|  3|
|  5|  3|  3|
+---+---+---+

answered Oct 13 '22 08:10

jxc

The explode function returns a new row for each element in the given array or map.

One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the resulting array.

from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, IntegerType
    
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)] ,["A", "B", "n"]) 

+---+---+---+
|  A|  B|  n|
+---+---+---+
|  1|  2|  1|
|  2|  9|  1|
|  3|  8|  2|
|  4|  1|  1|
|  5|  3|  3|
+---+---+---+

# use udf function to transform the n value to n times
n_to_array = udf(lambda n : [n] * n, ArrayType(IntegerType()))
df2 = df.withColumn('n', n_to_array(df.n))

+---+---+---------+
|  A|  B|        n|
+---+---+---------+
|  1|  2|      [1]|
|  2|  9|      [1]|
|  3|  8|   [2, 2]|
|  4|  1|      [1]|
|  5|  3|[3, 3, 3]|
+---+---+---------+ 

# now use explode  
df2.withColumn('n', explode(df2.n)).show()

+---+---+---+ 
| A | B | n | 
+---+---+---+ 
|  1|  2|  1| 
|  2|  9|  1| 
|  3|  8|  2| 
|  3|  8|  2| 
|  4|  1|  1| 
|  5|  3|  3| 
|  5|  3|  3| 
|  5|  3|  3| 
+---+---+---+

answered Oct 13 '22 08:10

Ahmed

Related questions
                            
                                Understanding python's lstrip method on strings [duplicate]
                            
                                tkinter Treeview widget inserting data
                            
                                Scrapy getting href out of div
                            
                                Can I dump blank instead of null in yaml/pyyaml?
                            
                                How to web scrape followers from Instagram web browser?
                            
                                How do i declare more than one extra-index-url in pip.conf
                            
                                how does 2d kernel density estimation in python (sklearn) work?
                            
                                pandas to_json returns a string not a json object
                            
                                PyQt - Connect QAction to function
                            
                                Check if single element is contained in Numpy Array
                            
                                TypeError: expected string or bytes-like object – with Python/NLTK word_tokenize
                            
                                Fastest way to merge pandas dataframe on ranges
                            
                                Understanding LDA / topic modelling -- too much topic overlap
                            
                                Do not use tf.reset_default_graph() to clear nested graphs
                            
                                Python Finance: How to use macd indicator for signals strategy?
                            
                                Python difference between filter() and map()
                            
                                Pandas merge two datasets with same number of rows
                            
                                Very Basic Numpy array dimension visualization
                            
                                Standardize some columns in Python Pandas dataframe?
                            
                                Scaling / Normalizing pandas column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: how to duplicate a row n time in dataframe?

Tags:

python

pyspark

bigdata

Chjul

People also ask

2 Answers

jxc

Ahmed

Recent Activity

Donate For Us