Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Divide Pyspark Dataframe Column by Column in other Pyspark Dataframe when ID Matches

I have a PySpark DataFrame, df1, that looks like:

CustomerID  CustomerValue
12          .17
14          .15
14          .25
17          .50
17          .01
17          .35

I have a second PySpark DataFrame, df2, that is df1 grouped by CustomerID and aggregated by the sum function. It looks like this:

 CustomerID  CustomerValueSum
 12          .17
 14          .40
 17          .86

I want to add a third column to df1 that is df1['CustomerValue'] divided by df2['CustomerValueSum'] for the same CustomerIDs. This would look like:

CustomerID  CustomerValue  NormalizedCustomerValue
12          .17            1.00
14          .15            .38
14          .25            .62
17          .50            .58
17          .01            .01
17          .35            .41

In other words, I'm trying to convert this Python/Pandas code to PySpark:

normalized_list = []
for idx, row in df1.iterrows():
    (
        normalized_list
        .append(
            row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum
        )
    )
df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list]

How can I do this?

like image 387
TrentWoodbury Avatar asked Apr 07 '17 21:04

TrentWoodbury


People also ask

How do you split a column in Pyspark DataFrame?

The PySpark SQL provides the split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame It can be done by splitting the string column on the delimiter like space, comma, pipe, etc.

How do you split a column into multiple columns in Pyspark DataFrame?

functions provide a function split() which is used to split DataFrame string Column into multiple columns. Parameters: str: str is a Column or str to split.

How do you divide a column by a number in a Pyspark DataFrame?

mul() is used to multiply all the values in the entire dataframe with a value, and div() is used to divide all the values by a value in the pyspark pandas dataframe and return the quotient. mod() is used to divide all the values by a value in the pyspark pandas dataframe and return the remainder.

How do you split a DataFrame into multiple data frames in Pyspark?

Example 1: Split dataframe using 'DataFrame.limit()' We will make use of the split() method to create 'n' equal dataframes. Where, Limits the result count to the number specified.


1 Answers

Code:

import pyspark.sql.functions as F

df1 = df1\
    .join(df2, "CustomerID")\
    .withColumn("NormalizedCustomerValue", (F.col("CustomerValue") / F.col("CustomerValueSum")))\
    .drop("CustomerValueSum")

Output:

df1.show()

+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
|        17|          0.5|     0.5813953488372093|
|        17|         0.01|   0.011627906976744186|
|        17|         0.35|     0.4069767441860465|
|        12|         0.17|                    1.0|
|        14|         0.15|    0.37499999999999994|
|        14|         0.25|                  0.625|
+----------+-------------+-----------------------+
like image 179
dfernig Avatar answered Sep 21 '22 13:09

dfernig