Add Column When not Exists on DataFrame
In this method, the user can add a column when it is not existed by adding a column with the lit() function and checking using if the condition.
Syntax:
if 'column_name' not in dataframe.columns: dataframe.withColumn("column_name",lit(value))
where,
- dataframe. columns are used to get the column names
Example:
In this example, we add a column of the salary to 34000 using the if condition with the withColumn() and the lit() function.
Python3
# importing module import pyspark # import concat_ws and lit function from pyspark.sql.functions import concat_ws, lit # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # add salary column by checking its existence if 'salary' not in dataframe.columns: dataframe.withColumn( "salary" , lit( 34000 )).show() |
Output:
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe.
Create the first data frame for demonstration:
Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() |
Output: