Leftsemi join
This join will all rows from the first dataframe and return only matched rows from the second dataframe
Syntax: dataframe1.join(dataframe2,dataframe1.column_name == dataframe2.column_name,”leftsemi”)
Example: In this example, we are going to perform leftsemi join using leftsemi keyword based on the ID column in both dataframes.
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # list of employee data data1 = [[ "1" , "45000" , "IT" ], [ "2" , "145000" , "Manager" ], [ "6" , "45000" , "HR" ], [ "5" , "34000" , "Sales" ]] # specify column names columns = [ 'ID' , 'salary' , 'department' ] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data1, columns) # leftsemi join on two dataframes dataframe.join(dataframe1, dataframe. ID = = dataframe1. ID , "leftsemi" ).show() |
Output:
PySpark Join Types – Join Two DataFrames
In this article, we are going to see how to join two dataframes in Pyspark using Python. Join is used to combine two or more dataframes based on columns in the dataframe.
Syntax: dataframe1.join(dataframe2,dataframe1.column_name == dataframe2.column_name,”type”)
where,
- dataframe1 is the first dataframe
- dataframe2 is the second dataframe
- column_name is the column which are matching in both the dataframes
- type is the join type we have to join
Create the first dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() |
Output:
Create second dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data1 = [[ "1" , "45000" , "IT" ], [ "2" , "145000" , "Manager" ], [ "6" , "45000" , "HR" ], [ "5" , "34000" , "Sales" ]] # specify column names columns = [ 'ID' , 'salary' , 'department' ] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data1, columns) dataframe1.show() |
Output: