I have data like below. Filename:babynames.csv. <pre class="prettyprint"><code>year name percent sex 1880 John 0.081541 boy 1880 William 0.080511 boy 1880 James 0.050057 boy </code></pre> I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD). <pre class="prettyprint"><code>year sex avg(percentage) count(rows) 1880 boy 0.070703 3 </code></pre> I am not sure how to proceed after the following step in pyspark. Need your help on this <pre class="prettyprint"><code>testrdd = sc.textFile("babynames.csv"); rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0]) aggregatedoutput = ???? </code></pre>

<ol> <li>Follow the instructions from the README to include <code>spark-csv</code> package </li> <li> Load data <pre class="prettyprint"><code>df = (sqlContext.read .format("com.databricks.spark.csv") .options(inferSchema="true", delimiter=";", header="true") .load("babynames.csv")) </code></pre> </li> <li> Import required functions <pre class="prettyprint"><code>from pyspark.sql.functions import count, avg </code></pre> </li> <li> Group by and aggregate (optionally use <code>Column.alias</code>: <pre class="prettyprint"><code>df.groupBy("year", "sex").agg(avg("percent"), count("*")) </code></pre> </li> </ol> Alternatively: <ul> <li>cast <code>percent</code> to numeric </li> <li>reshape to a format ((<code>year</code>, <code>sex</code>), <code>percent</code>)</li> <li> <code>aggregateByKey</code> using <code>pyspark.statcounter.StatCounter</code> </li> </ul>

Pyspark - Aggregation on multiple columns

Tags:

python

python-2.7

apache-spark

pyspark

I have data like below. Filename:babynames.csv.

Click to copy

year    name    percent     sex
1880    John    0.081541    boy
1880    William 0.080511    boy
1880    James   0.050057    boy

I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).

Click to copy

year    sex   avg(percentage)   count(rows)
1880    boy   0.070703         3

I am not sure how to proceed after the following step in pyspark. Need your help on this

Click to copy

testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????

679

asked Mar 27 '16 18:03

Mohan

1 Answers

Follow the instructions from the README to include spark-csv package

Load data

Click to copy

df = (sqlContext.read
    .format("com.databricks.spark.csv")
    .options(inferSchema="true", delimiter=";", header="true")
    .load("babynames.csv"))

Import required functions

Click to copy

from pyspark.sql.functions import count, avg

Group by and aggregate (optionally use Column.alias:

Click to copy
```
df.groupBy("year", "sex").agg(avg("percent"), count("*"))
```

Alternatively:

cast percent to numeric
reshape to a format ((year, sex), percent)
aggregateByKey using pyspark.statcounter.StatCounter

107

answered Oct 01 '22 19:10

3 revs

Related questions
                            
                                Can I get Python debugger pdb to output with Color?
                            
                                Picking a Random Word from a list in python?
                            
                                Cryptography tools for python 3
                            
                                Python script to see if a web page exists without downloading the whole page?
                            
                                Why I can't use urlencode to encode json format data?
                            
                                How to use urllib2.urlopen to make POST request without data argument
                            
                                Delete Characters in Python Printed Line
                            
                                How do I remove the last n characters from a string?
                            
                                matplotlib - 3d surface from a rectangular array of heights
                            
                                How to create fake text file in Python
                            
                                Django how to check if the object has property in view
                            
                                How to convert object to json file for three.js model loader
                            
                                Cannot write XML file with default namespace [duplicate]
                            
                                Call python script from ruby
                            
                                Deploying Django project with Gunicorn and nginx
                            
                                Insert and update with core SQLAlchemy
                            
                                Python/matplotlib : getting rid of matplotlib.mpl warning
                            
                                How to exit a Kivy application using a button
                            
                                Issues iterating through JSON list in Python?
                            
                                Matplotlib.pyplot.hist() very slow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark - Aggregation on multiple columns

Tags:

python

python-2.7

apache-spark

pyspark

Mohan

People also ask

1 Answers

3 revs

Recent Activity

Donate For Us