Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark - Aggregation on multiple columns

I have data like below. Filename:babynames.csv.

year    name    percent     sex
1880    John    0.081541    boy
1880    William 0.080511    boy
1880    James   0.050057    boy

I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).

year    sex   avg(percentage)   count(rows)
1880    boy   0.070703         3

I am not sure how to proceed after the following step in pyspark. Need your help on this

testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????
like image 679
Mohan Avatar asked Mar 27 '16 18:03

Mohan


People also ask

How do you aggregate columns in PySpark?

Method 1: Using groupBy() Method In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. Here the aggregate function is sum(). sum(): This will return the total values for each group.

How do you add two aggregate functions in PySpark?

We can aggregate multiple functions using the following syntax. Syntax: dataframe. groupBy('column_name_group').

How do you get all the columns after groupBy in PySpark?

1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.


1 Answers

  1. Follow the instructions from the README to include spark-csv package
  2. Load data

    df = (sqlContext.read
        .format("com.databricks.spark.csv")
        .options(inferSchema="true", delimiter=";", header="true")
        .load("babynames.csv"))
    
  3. Import required functions

    from pyspark.sql.functions import count, avg
    
  4. Group by and aggregate (optionally use Column.alias:

    df.groupBy("year", "sex").agg(avg("percent"), count("*"))
    

Alternatively:

  • cast percent to numeric
  • reshape to a format ((year, sex), percent)
  • aggregateByKey using pyspark.statcounter.StatCounter
like image 107
3 revs Avatar answered Oct 01 '22 19:10

3 revs