Leftsemi join

This join will all rows from the first dataframe and return only matched rows from the second dataframe

Syntax: dataframe1.join(dataframe2,dataframe1.column_name ==  dataframe2.column_name,”leftsemi”)

Example: In this example, we are going to perform leftsemi join using leftsemi keyword based on the ID column in both dataframes.

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"], 
        ["5", "bobby", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# list  of employee data
data1 = [["1", "45000", "IT"],
         ["2", "145000", "Manager"], 
         ["6", "45000", "HR"],
         ["5", "34000", "Sales"]]
  
# specify column names
columns = ['ID', 'salary', 'department']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data1, columns)
  
# leftsemi join on two dataframes
dataframe.join(dataframe1,
               dataframe.ID == dataframe1.ID
               "leftsemi").show()


Output:

PySpark Join Types – Join Two DataFrames

In this article, we are going to see how to join two dataframes in Pyspark using Python. Join is used to combine two or more dataframes based on columns in the dataframe.

Syntax: dataframe1.join(dataframe2,dataframe1.column_name ==  dataframe2.column_name,”type”) 

where,

  1. dataframe1 is the first dataframe
  2. dataframe2 is the second dataframe
  3. column_name is the column which are matching in both the dataframes
  4. type is the join type we have to join

Create the first dataframe for demonstration:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"], 
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"], 
        ["5", "bobby", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe.show()


Output:

Create second dataframe for demonstration:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data1 = [["1", "45000", "IT"],
         ["2", "145000", "Manager"],
         ["6", "45000", "HR"],
         ["5", "34000", "Sales"]]
  
# specify column names
columns = ['ID', 'salary', 'department']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data1, columns)
  
dataframe1.show()


Output:

Similar Reads

Inner join

...

Full Outer Join

...

Left Join

This will join the two PySpark dataframes on key columns, which are common in both dataframes....

Right Join

...

Leftsemi join

This join joins the two dataframes with all matching and non-matching rows, we can perform this join in three ways...

LeftAnti join

...

SQL Expression

...