Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiplying a column in a Spark dataframe by a constant value

I am having an issue creating a new column in my Spark dataframe. I'm attemping to create a new column using withColumn() as follows:

.withColumn('%_diff_from_avg', 
     ((col('aggregate_sales') - col('avg_sales')) / col('avg_sales') * 100))

This results in some values calculated correctly, but most of the values in my resultant table are null. I don't understand why.

Interestingly, when I drop the '* 100' from the calculation, all my values are populated correctly - i.e. no nulls. For example:

.withColumn('%_diff_from_avg', 
    ((col('aggregate_sales') - col('avg_sales')) / col('avg_sales')))

seems to work.

So it seems that the multiplication by 100 is causing the issue.

Can anyone explain why?

like image 901
W05aDePQw6h8e7 Avatar asked Oct 06 '17 15:10

W05aDePQw6h8e7


People also ask

How do I add a constant column in Spark DataFrame?

Add New Column with Constant Value In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I change the value of a column in a Dataset in Spark?

Update the column value Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with.

What does withColumn in Spark do?

In Spark SQL, the withColumn() function is the most popular one, which is used to derive a column from multiple columns, change the current value of a column, convert the datatype of an existing column, create a new column, and many more.

How do pandas multiply values?

Pandas DataFrame mul() MethodThe mul() method multiplies each value in the DataFrame with a specified value. The specified value must be an object that can be multiplied with the values of the DataFrame.


1 Answers

This happened with me too. It could be some issue with the types of data of your columns. Try this:

.withColumn('%_diff_from_avg', 
     ((col('aggregate_sales') - col('avg_sales')) / col('avg_sales') * 100.0))

It worked for me.

like image 138
Gabriel Andriotti Avatar answered Sep 20 '22 06:09

Gabriel Andriotti