Add Column to DataFrame using SQL Expression
In this method, the user has to use SQL expression with SQL function to add a column. Before that, we have to create a temporary view, From that view, we have to add and select columns.
Syntax:
dataframe.createOrReplaceTempView("name") spark.sql("select 'value' as column_name from view")
where,
- dataframe is the input dataframe
- name is the temporary view name
- sql function will take SQL expression as input to add a column
- column_name is the new column name
- value is the column value
Example:
Add new column named salary with 34000 value
Python3
# importing module import pyspark # import concat_ws and lit function from pyspark.sql.functions import concat_ws, lit # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # create view dataframe.createOrReplaceTempView( "view" ) # add new column named salary with 34000 value spark.sql( "select '34000' as salary from view" ).show() |
Output:
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe.
Create the first data frame for demonstration:
Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() |
Output: