How to use agg() function with GroupBy() In Python

Here we have to import the sum function from sql.functions module to be used with the aggregate method.

Syntax: dataframe.groupBy(“group_column”).agg(sum(“column_name”))

where,

dataframe is the pyspark dataframe

group_column is the grouping column

column_name is the column to get sum

Python3

# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# import sum
from pyspark.sql.functions import sum
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
 
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# Groupby with DEPT and NAME with sum()
dataframe.groupBy("DEPT").agg(sum("FEE")).show()

Output:

Pyspark dataframe: Summing column while grouping over another

In this article, we will discuss how to sum a column while grouping another in Pyspark dataframe using Python.

Let’s create the dataframe for demonstration:

Python3

# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
 
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# display
dataframe.show()

Output:

How to use agg() function with GroupBy() In Python

Python3

Pyspark dataframe: Summing column while grouping over another

Let’s create the dataframe for demonstration:

Python3

Categories

Contact US

How to use agg() function with GroupBy() In Python

Python3

Pyspark dataframe: Summing column while grouping over another

Let’s create the dataframe for demonstration:

Python3

Similar Reads

Categories

Contact US