Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark withColumn() performing power functions

I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.

df = df.withColumn("col3", 100**(df("col1")))*df("col2")

However, this always results in:

TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'

I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.

If I perform

results = df.map(lambda x : 100**(df("col2"))*df("col2"))

this works, but I can't append to my original data frame.

Any thoughts?

This is my first time posting, so I apologize for any formatting problems.

like image 432
zdcheng Avatar asked Oct 22 '15 00:10

zdcheng


People also ask

What does withColumn do in spark?

Returns a new DataFrame by adding a column or replacing the existing column that has the same name. The column expression must be an expression over this DataFrame ; attempting to add a column from some other DataFrame will raise an error. New in version 1.3.

What are the two arguments for the withColumn () function?

The withColumn() function takes two arguments, the first argument is the name of the new column and the second argument is the value of the column in Column type.

Does withColumn replace existing column?

The withColumn creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one.

What does show () do in PySpark?

Spark show() – Display DataFrame Contents in Table. Spark DataFrame show() is used to display the contents of the DataFrame in a Table Row & Column Format. By default, it shows only 20 Rows and the column values are truncated at 20 characters.


1 Answers

Since Spark 1.4 you can usepow function as follows:

from pyspark.sql import Row
from pyspark.sql.functions import pow, col

row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()

df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()

## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## |   1|   2| 1.0|
## |   2|   3| 8.0|
## |   3|   3|27.0|
## +----+----+----+

If you use an older version a Python UDF should do the trick:

import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())
like image 138
zero323 Avatar answered Oct 10 '22 17:10

zero323