I have a PySpark DataFrame, df1, that looks like: <pre class="prettyprint"><code>CustomerID CustomerValue 12 .17 14 .15 14 .25 17 .50 17 .01 17 .35 </code></pre> I have a second PySpark DataFrame, df2, that is df1 grouped by CustomerID and aggregated by the sum function. It looks like this: <pre class="prettyprint"><code> CustomerID CustomerValueSum 12 .17 14 .40 17 .86 </code></pre> I want to add a third column to df1 that is df1['CustomerValue'] divided by df2['CustomerValueSum'] for the same CustomerIDs. This would look like: <pre class="prettyprint"><code>CustomerID CustomerValue NormalizedCustomerValue 12 .17 1.00 14 .15 .38 14 .25 .62 17 .50 .58 17 .01 .01 17 .35 .41 </code></pre> In other words, I'm trying to convert this Python/Pandas code to PySpark: <pre class="prettyprint"><code>normalized_list = [] for idx, row in df1.iterrows(): ( normalized_list .append( row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum ) ) df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list] </code></pre> How can I do this?

Code: <pre class="prettyprint"><code>import pyspark.sql.functions as F df1 = df1\ .join(df2, "CustomerID")\ .withColumn("NormalizedCustomerValue", (F.col("CustomerValue") / F.col("CustomerValueSum")))\ .drop("CustomerValueSum") </code></pre> Output: <pre class="prettyprint"><code>df1.show() +----------+-------------+-----------------------+ |CustomerID|CustomerValue|NormalizedCustomerValue| +----------+-------------+-----------------------+ | 17| 0.5| 0.5813953488372093| | 17| 0.01| 0.011627906976744186| | 17| 0.35| 0.4069767441860465| | 12| 0.17| 1.0| | 14| 0.15| 0.37499999999999994| | 14| 0.25| 0.625| +----------+-------------+-----------------------+ </code></pre>

Divide Pyspark Dataframe Column by Column in other Pyspark Dataframe when ID Matches

Tags:

python

pyspark

spark-dataframe

I have a PySpark DataFrame, df1, that looks like:

CustomerID  CustomerValue
12          .17
14          .15
14          .25
17          .50
17          .01
17          .35

I have a second PySpark DataFrame, df2, that is df1 grouped by CustomerID and aggregated by the sum function. It looks like this:

 CustomerID  CustomerValueSum
 12          .17
 14          .40
 17          .86

I want to add a third column to df1 that is df1['CustomerValue'] divided by df2['CustomerValueSum'] for the same CustomerIDs. This would look like:

CustomerID  CustomerValue  NormalizedCustomerValue
12          .17            1.00
14          .15            .38
14          .25            .62
17          .50            .58
17          .01            .01
17          .35            .41

In other words, I'm trying to convert this Python/Pandas code to PySpark:

normalized_list = []
for idx, row in df1.iterrows():
    (
        normalized_list
        .append(
            row.CustomerValue / df2[df2.CustomerID == row.CustomerID].CustomerValueSum
        )
    )
df1['NormalizedCustomerValue'] = [val.values[0] for val in normalized_list]

How can I do this?

387

asked Apr 07 '17 21:04

TrentWoodbury

1 Answers

Code:

import pyspark.sql.functions as F

df1 = df1\
    .join(df2, "CustomerID")\
    .withColumn("NormalizedCustomerValue", (F.col("CustomerValue") / F.col("CustomerValueSum")))\
    .drop("CustomerValueSum")

Output:

df1.show()

+----------+-------------+-----------------------+
|CustomerID|CustomerValue|NormalizedCustomerValue|
+----------+-------------+-----------------------+
|        17|          0.5|     0.5813953488372093|
|        17|         0.01|   0.011627906976744186|
|        17|         0.35|     0.4069767441860465|
|        12|         0.17|                    1.0|
|        14|         0.15|    0.37499999999999994|
|        14|         0.25|                  0.625|
+----------+-------------+-----------------------+

179

answered Sep 21 '22 13:09

dfernig

Related questions
                            
                                TypeError: the first argument must be callable
                            
                                Spyder IDE Console History
                            
                                Django Rest Framework {"detail":"Authentication credentials were not provided."}
                            
                                map_async vs apply_async:what should I use in this case
                            
                                Wagtail: Display a list of child pages inside a parent page
                            
                                Networkx never finishes calculating Betweenness centrality for 2 mil nodes
                            
                                Bad JSON - Keys are not quoted
                            
                                Not finding static files django 1.9 gunicorn
                            
                                pandas: how to find the most frequent value of each row?
                            
                                'PySide.QtCore.Signal' object has no attribute 'connect'
                            
                                Python Lambda Function Parsing DynamoDB's JSON Format
                            
                                Python requests call with URL using parameters
                            
                                how to compare two columns in pandas to make a third column ?
                            
                                How to set coordinates when cropping an image with PIL?
                            
                                Get Scrapy crawler output/results in script file function
                            
                                Pandas dataframe to count matrix
                            
                                How to print multiple non-consecutive values from a list with Python 3.5.1
                            
                                Finding All The Keys With the Same Value in a Python Dictionary [duplicate]
                            
                                How to groupby based on two columns in pandas?
                            
                                How can I multiply a vector and a matrix in tensorflow without reshaping?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With