Add Column When not Exists on DataFrame

In this method, the user can add a column when it is not existed by adding a column with the lit() function and checking using if the condition.

Syntax:

if 'column_name' not in dataframe.columns:
   dataframe.withColumn("column_name",lit(value))

where,

  • dataframe. columns are used to get the column names

Example:

In this example, we add a column of the salary to 34000 using the if condition with the withColumn() and the lit() function.

Python3




# importing module
import pyspark
 
# import concat_ws and lit function
from pyspark.sql.functions import concat_ws, lit
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]
 
# specify column names
columns = ['ID', 'NAME', 'Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# add salary column by checking its existence
if 'salary' not in dataframe.columns:
    dataframe.withColumn("salary", lit(34000)).show()


Output:

How to add a new column to a PySpark DataFrame ?

In this article, we will discuss how to add a new column to PySpark Dataframe.

Create the first data frame for demonstration:

Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose.

Python3




# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]
 
# specify column names
columns = ['ID', 'NAME', 'Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
dataframe.show()


Output:

Similar Reads

Method 1: Add New Column With Constant Value

...

Method 2: Add Column Based on Another Column of DataFrame

In this approach to add a new column with constant values, the user needs to call the lit() function parameter of the withColumn() function and pass the required parameters into these functions. Here, the lit() is available in pyspark.sql. Functions module....

Method 3: Add Column When not Exists on DataFrame

...

Method 4: Add Column to DataFrame using select()

Under this approach, the user can add a new column based on an existing column in the given dataframe....

Method 5: Add Column to DataFrame using SQL Expression

...

Method 6: Add Column Value Based on Condition

...