Add Column Based on Another Column of DataFrame
Under this approach, the user can add a new column based on an existing column in the given dataframe.
Example 1: Using withColumn() method
Here, under this example, the user needs to specify the existing column using the withColumn() function with the required parameters passed in the python programming language.
Syntax:
dataframe.withColumn("column_name", dataframe.existing_column)
where,
- dataframe is the input dataframe
- column_name is the new column
- existing_column is the column which is existed
In this example, we are adding a column named salary from the ID column with multiply of 2300 using the withColumn() method in the python language,
Python3
# importing module import pyspark # import lit function from pyspark.sql.functions import lit # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Add a column named salary from ID column with multiply of 2300 dataframe.withColumn( "salary" , dataframe. ID * 2300 ).show() |
Output:
Example 2 : Using concat_ws()
Under this example, the user has to concat the two existing columns and make them as a new column by importing this method from pyspark.sql.functions module.
Syntax:
dataframe.withColumn(“column_name”, concat_ws(“Separator”,”existing_column1″,’existing_column2′))
where,
- dataframe is the input dataframe
- column_name is the new column name
- existing_column1 and existing_column2 are the two columns to be added with Separator to make values to the new column
- Separator is like the operator between values with two columns
Example:
In this example, we add a column named Details from Name and Company columns separated by “-” in the python language.
Python3
# importing module import pyspark # import concat_ws function from pyspark.sql.functions import concat_ws # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Add a column named Details from Name and Company columns separated by - dataframe.withColumn( "Details" , concat_ws( "-" , "NAME" , 'Company' )).show() |
Output:
How to add a new column to a PySpark DataFrame ?
In this article, we will discuss how to add a new column to PySpark Dataframe.
Create the first data frame for demonstration:
Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() |
Output: