PySpark: Add a column to DataFrame when column is a list

Tags:

I have read similar questions but couldn't find a solution to my specific problem.

I have a list

l = [1, 2, 3]

and a DataFrame

df = sc.parallelize([
    ['p1', 'a'],
    ['p2', 'b'],
    ['p3', 'c'],
]).toDF(('product', 'name'))

I would like to obtain a new DataFrame where the list l is added as a further column, namely

+-------+----+---------+
|product|name| new_col |
+-------+----+---------+
|     p1|   a|     1   |
|     p2|   b|     2   |
|     p3|   c|     3   |
+-------+----+---------+

Approaches with JOIN, where I was joining df with an

 sc.parallelize([[1], [2], [3]])

have failed. Approaches using withColumn, as in

new_df = df.withColumn('new_col', l)

have failed because the list is not a Column object.

544

asked Mar 21 '16 13:03

mar tin

1 Answers

So, from reading some interesting stuff here, I've ascertained that you can't really just append a random / arbitrary column to a given DataFrame object. It appears what you want is more of a zip than a join. I looked around and found this ticket, which makes me think you won't be able to zip given that you have DataFrame rather than RDD objects.

The only way I've been able to solve your issue invovles leaving the world of DataFrame objects and returning to RDD objects. I've also needed to create an index for the purpose of the join, which may or may not work with your use case.

l = sc.parallelize([1, 2, 3])
index = sc.parallelize(range(0, l.count()))
z = index.zip(l)

rdd = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']])
rdd_index = index.zip(rdd)

# just in case!
assert(rdd.count() == l.count())
# perform an inner join on the index we generated above, then map it to look pretty.
new_rdd = rdd_index.join(z).map(lambda (x, y): [y[0][0], y[0][1], y[1]])
new_df = new_rdd.toDF(["product", 'name', 'new_col'])

When I run new_df.show(), I get:

+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
|     p1|   a|      1|
|     p2|   b|      2|
|     p3|   c|      3|
+-------+----+-------+

Sidenote: I'm really surprised this didn't work. Looks like an outer join?

from pyspark.sql import Row
l = sc.parallelize([1, 2, 3])
new_row = Row("new_col_name")
l_as_df = l.map(new_row).toDF()
new_df = df.join(l_as_df)

When I run new_df.show(), I get:

+-------+----+------------+
|product|name|new_col_name|
+-------+----+------------+
|     p1|   a|           1|
|     p1|   a|           2|
|     p1|   a|           3|
|     p2|   b|           1|
|     p3|   c|           1|
|     p2|   b|           2|
|     p2|   b|           3|
|     p3|   c|           2|
|     p3|   c|           3|
+-------+----+------------+

166

answered Nov 15 '22 12:11

Katya Willard

Related questions
                            
                                pyttsx compilation error in windows using py2xe
                            
                                Display width of unicode strings in Python [duplicate]
                            
                                PyPI index vs simple index
                            
                                Installing numpy for Python 2.7 while also having Python 3.4 installed?
                            
                                Python import paramiko error "cannot import name util"
                            
                                Who calls the metaclass
                            
                                Displaying an Image in the iPython qtconsole
                            
                                Plot : Too many ticks on X axe
                            
                                Flask - Store values in memory between requests
                            
                                SQLAlchemy: Using delete/update with a join query
                            
                                Is it possible to run another spider from Scrapy spider?
                            
                                Python - Selenium PhantomJS - JSON Error
                            
                                Managing requirements in IPython Notebook?
                            
                                Subclassing matplotlib Text: manipulate properties of child artist
                            
                                Viewing Django and webpack built site on LAN
                            
                                Adding data to Pandas Dataframe from a CSV file causing Value Errors
                            
                                Import keras.datasets not working
                            
                                Image in README.rst not displaying in pypi
                            
                                `pip freeze` breaks with package installation
                            
                                How to use PyCharm for GIMP plugin development?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark: Add a column to DataFrame when column is a list

Tags:

python

dataframe

pyspark

mar tin

People also ask

1 Answers

Katya Willard

Recent Activity

Donate For Us