How to use agg() function with GroupBy() In Python

Here we have to import the sum function from sql.functions  module to be used with the aggregate method.

Syntax: dataframe.groupBy(“group_column”).agg(sum(“column_name”))

where,

  • dataframe is the pyspark dataframe
  • group_column is the grouping column
  • column_name is the column to get sum

Python3




# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# import sum
from pyspark.sql.functions import sum
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
 
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# Groupby with DEPT and NAME with sum()
dataframe.groupBy("DEPT").agg(sum("FEE")).show()


Output:

Pyspark dataframe: Summing column while grouping over another

In this article, we will discuss how to sum a column while grouping another in Pyspark dataframe using Python.

Let’s create the dataframe for demonstration:

Python3




# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
 
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# display
dataframe.show()


Output:

Similar Reads

Method 1: Using groupBy() Method

...

Method 2: Using agg() function with GroupBy()

In PySpark,  groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. Here the aggregate function is sum()....

Method 3: Using Window function with sum

...