I have data like below. Filename:babynames.csv.
year name percent sex
1880 John 0.081541 boy
1880 William 0.080511 boy
1880 James 0.050057 boy
I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).
year sex avg(percentage) count(rows)
1880 boy 0.070703 3
I am not sure how to proceed after the following step in pyspark. Need your help on this
testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????
Method 1: Using groupBy() Method In PySpark, groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. Here the aggregate function is sum(). sum(): This will return the total values for each group.
We can aggregate multiple functions using the following syntax. Syntax: dataframe. groupBy('column_name_group').
1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.
spark-csv
package
Load data
df = (sqlContext.read
.format("com.databricks.spark.csv")
.options(inferSchema="true", delimiter=";", header="true")
.load("babynames.csv"))
Import required functions
from pyspark.sql.functions import count, avg
Group by and aggregate (optionally use Column.alias
:
df.groupBy("year", "sex").agg(avg("percent"), count("*"))
Alternatively:
percent
to numeric year
, sex
), percent
)aggregateByKey
using pyspark.statcounter.StatCounter
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With