Add Column Based on Another Column of DataFrame

Under this approach, the user can add a new column based on an existing column in the given dataframe.

Example 1: Using withColumn() method

Here, under this example, the user needs to specify the existing column using the withColumn() function with the required parameters passed in the python programming language.

Syntax:

dataframe.withColumn("column_name", dataframe.existing_column)

where,

  • dataframe is the input dataframe
  • column_name is the new column
  • existing_column is the column which is existed

In this example, we are adding a column named salary from the ID column with multiply of 2300 using the withColumn() method in the python language,

Python3




# importing module
import pyspark
 
# import lit function
from pyspark.sql.functions import lit
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]
 
# specify column names
columns = ['ID', 'NAME', 'Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# Add a column named salary from ID column with multiply of 2300
dataframe.withColumn("salary", dataframe.ID*2300).show()


Output:

Example 2 : Using concat_ws()

Under this example, the user has to concat the two existing columns and make them as a new column by importing this method from pyspark.sql.functions module.

Syntax:

dataframe.withColumn(“column_name”, concat_ws(“Separator”,”existing_column1″,’existing_column2′))

where,

  • dataframe is the input dataframe
  • column_name is the new column name
  • existing_column1 and existing_column2 are the two columns to be added with Separator to make values to the new column
  • Separator is like the operator between values with two columns

Example:

In this example, we add a column named Details from Name and Company columns separated by “-” in the python language.

Python3




# importing module
import pyspark
 
# import concat_ws function
from pyspark.sql.functions import concat_ws
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]
 
# specify column names
columns = ['ID', 'NAME', 'Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# Add a column named Details from Name and Company columns separated by -
dataframe.withColumn("Details", concat_ws("-", "NAME", 'Company')).show()


Output:

How to add a new column to a PySpark DataFrame ?

In this article, we will discuss how to add a new column to PySpark Dataframe.

Create the first data frame for demonstration:

Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose.

Python3




# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]
 
# specify column names
columns = ['ID', 'NAME', 'Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
dataframe.show()


Output:

Similar Reads

Method 1: Add New Column With Constant Value

...

Method 2: Add Column Based on Another Column of DataFrame

In this approach to add a new column with constant values, the user needs to call the lit() function parameter of the withColumn() function and pass the required parameters into these functions. Here, the lit() is available in pyspark.sql. Functions module....

Method 3: Add Column When not Exists on DataFrame

...

Method 4: Add Column to DataFrame using select()

Under this approach, the user can add a new column based on an existing column in the given dataframe....

Method 5: Add Column to DataFrame using SQL Expression

...

Method 6: Add Column Value Based on Condition

...