I have a dataframe which looks like this <pre class="prettyprint"><code> dSc TranAmount 1: 100021 79.64 2: 100021 79.64 3: 100021 0.16 4: 100022 11.65 5: 100022 0.36 6: 100022 0.47 7: 100025 0.17 8: 100037 0.27 9: 100056 0.27 10: 100063 0.13 11: 100079 0.13 12: 100091 0.15 13: 100101 0.22 14: 100108 0.14 15: 100109 0.04 </code></pre> Now I want to create a third column with the z-score of each <code>TranAmount</code> which will be <pre class="prettyprint"><code>(TranAmount-mean(TranAmount))/StdDev(TranAmount) </code></pre> here mean and standard deviation will be based on groups of each dSc Now I can calculate mean and standard deviation in Spark SQL. <pre class="prettyprint"><code>(datafromdb .groupBy("dSc") .agg(datafromdb.dSc, func.avg("TranAmount") ,func.stddev_pop("TranAmount"))) </code></pre> but I am at a loss on how to achieve a third column with the z-score in the data frame. I would appreciate any pointer to the right way of achieving this/

You can for example compute statistics and <code>join</code> with the original data: <pre class="prettyprint"><code>stats = (df.groupBy("dsc") .agg( func.stddev_pop("TranAmount").alias("sd"), func.avg("TranAmount").alias("avg"))) df.join(broadcast(stats), ["dsc"]) (df .join(func.broadcast(stats), ["dsc"]) .select("dsc", "TranAmount", (df.TranAmount - stats.avg) / stats.sd)) </code></pre> or use window functions with standard deviation formula: <pre class="prettyprint"><code>from pyspark.sql.window import Window import sys def z_score_w(col, w): avg_ = func.avg(col).over(w) avg_sq = func.avg(col * col).over(w) sd_ = func.sqrt(avg_sq - avg_ * avg_) return (col - avg_) / sd_ w = Window().partitionBy("dsc").rowsBetween(-sys.maxsize, sys.maxsize) df.withColumn("zscore", z_score_w(df.TranAmount, w)) </code></pre>

How to create a z-score in Spark SQL for each group

Tags:

python

apache-spark

apache-spark-sql

pyspark

I have a dataframe which looks like this

        dSc     TranAmount
 1: 100021      79.64
 2: 100021      79.64
 3: 100021       0.16
 4: 100022      11.65
 5: 100022       0.36
 6: 100022       0.47
 7: 100025       0.17
 8: 100037       0.27
 9: 100056       0.27
10: 100063       0.13
11: 100079       0.13
12: 100091       0.15
13: 100101       0.22
14: 100108       0.14
15: 100109       0.04

Now I want to create a third column with the z-score of each TranAmount which will be

(TranAmount-mean(TranAmount))/StdDev(TranAmount)

here mean and standard deviation will be based on groups of each dSc

Now I can calculate mean and standard deviation in Spark SQL.

(datafromdb
  .groupBy("dSc")
  .agg(datafromdb.dSc, func.avg("TranAmount") ,func.stddev_pop("TranAmount")))

but I am at a loss on how to achieve a third column with the z-score in the data frame. I would appreciate any pointer to the right way of achieving this/

913

asked Apr 23 '16 07:04

Bg1850

1 Answers

You can for example compute statistics and join with the original data:

stats = (df.groupBy("dsc")
  .agg(
      func.stddev_pop("TranAmount").alias("sd"), 
      func.avg("TranAmount").alias("avg")))

df.join(broadcast(stats), ["dsc"])

(df
    .join(func.broadcast(stats), ["dsc"])
    .select("dsc", "TranAmount", (df.TranAmount - stats.avg) / stats.sd))

or use window functions with standard deviation formula:

from pyspark.sql.window import Window
import sys

def z_score_w(col, w):
    avg_ = func.avg(col).over(w)
    avg_sq = func.avg(col * col).over(w)
    sd_ = func.sqrt(avg_sq - avg_ * avg_)
    return (col - avg_) / sd_

w = Window().partitionBy("dsc").rowsBetween(-sys.maxsize, sys.maxsize)
df.withColumn("zscore", z_score_w(df.TranAmount, w))

157

answered Oct 26 '22 06:10

zero323

Related questions
                            
                                Random cropping data augmentation convolutional neural networks
                            
                                Is Python multiprocessing.Queue thread safe?
                            
                                How do I can install pip inside virtual environment
                            
                                Error Logging in Django and Gunicorn
                            
                                Override the authToken views in Django Rest
                            
                                Adding custom fields to a django model (without changes in source code)
                            
                                Vim searching: avoid matches within comments
                            
                                Variable not define after exec('variable = value')
                            
                                How do I check if a SQLite3 database is connected in Python?
                            
                                Loading empty dictionary when YAML file is empty (Python 3.4)
                            
                                How do you dynamically assign aliases in a django aggregate?
                            
                                Save pandas csv to sub-directory
                            
                                Return 'similar score' based on two dictionaries' similarity in Python?
                            
                                Sum of multiple list of lists index wise
                            
                                PySpark: spit out single file when writing instead of multiple part files
                            
                                PySpark using IAM roles to access S3
                            
                                Applying a function along an axis of a dask array
                            
                                How does 'autodoc_default_flags' work in python Sphinx configuration?
                            
                                How to use compile_commands.json with clang python bindings?
                            
                                Advancing Python generator function to just before the first yield [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With