I have data like below. Filename:babynames.csv.
year    name    percent     sex
1880    John    0.081541    boy
1880    William 0.080511    boy
1880    James   0.050057    boy
I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).
year    sex   avg(percentage)   count(rows)
1880    boy   0.070703         3
I am not sure how to proceed after the following step in pyspark. Need your help on this
testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????
                Method 1: Using groupBy() Method In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. Here the aggregate function is sum(). sum(): This will return the total values for each group.
We can aggregate multiple functions using the following syntax. Syntax: dataframe. groupBy('column_name_group').
1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.
spark-csv package
Load data
df = (sqlContext.read
    .format("com.databricks.spark.csv")
    .options(inferSchema="true", delimiter=";", header="true")
    .load("babynames.csv"))
Import required functions
from pyspark.sql.functions import count, avg
Group by and aggregate (optionally use Column.alias:
df.groupBy("year", "sex").agg(avg("percent"), count("*"))
Alternatively:
percent to numeric year, sex), percent)aggregateByKey using  pyspark.statcounter.StatCounter
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With