Spark withColumn() performing power functions

Tags:

I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.

df = df.withColumn("col3", 100**(df("col1")))*df("col2")

However, this always results in:

TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'

I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.

If I perform

results = df.map(lambda x : 100**(df("col2"))*df("col2"))

this works, but I can't append to my original data frame.

Any thoughts?

This is my first time posting, so I apologize for any formatting problems.

432

asked Oct 22 '15 00:10

zdcheng

1 Answers

Since Spark 1.4 you can usepow function as follows:

from pyspark.sql import Row
from pyspark.sql.functions import pow, col

row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()

df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()

## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## |   1|   2| 1.0|
## |   2|   3| 8.0|
## |   3|   3|27.0|
## +----+----+----+

If you use an older version a Python UDF should do the trick:

import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())

138

answered Oct 10 '22 17:10

zero323

Related questions
                            
                                Conda SSL error
                            
                                Django REST Framework: overriding get_queryset() sometimes returns a doubled queryset
                            
                                Formatting a 3d bar plot in matplot lib
                            
                                Sum up over np.array or np.float
                            
                                Creating a live plot of CSV data with Matplotlib
                            
                                Plotting oceans in maps using basemap and python
                            
                                Get the last tweet with tweepy
                            
                                Sorting by nested dictionary in Python dictionary
                            
                                Python install xmlrpclib
                            
                                Why parenthesis are needed in tuples? [duplicate]
                            
                                Which SQLAlchemy column type should be used for binary data?
                            
                                Truncating too long varchar when inserting to MySQL via SQLAlchemy
                            
                                How to check if a pointer is null in python?
                            
                                Pandas unit testing: How to assert equality of NaT and NaN values?
                            
                                Python Inherit from one class but override method calling another class?
                            
                                Pandas offset DatetimeIndex to next business if date is not a business day
                            
                                temporarily retrieve an image using the requests library
                            
                                How can I avoid: "ZipFile instance has no attribute '__exit__''" when extracting a zip file?
                            
                                Removing a dot in a scatter plot with matplotlib
                            
                                Using sqlalchemy scoped_session in theading.Thread

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark withColumn() performing power functions

Tags:

python

apache-spark

pyspark

zdcheng

People also ask

1 Answers

zero323

Recent Activity

Donate For Us