How to use unionByName() In Python
In Spark 3.1, you can easily achieve this using unionByName() for Concatenating the dataframe
Syntax: dataframe_1.unionByName(dataframe_2)
where,
- dataframe_1 is the first dataframe
- dataframe_2 is the second dataframe
Example:
Python3
# union the two dataftames by using unionByname result1 = df1.unionByName(df2) # display result1.show() |
Output:
+------+----------+------+------+ | Name| DOB|Gender|salary| +------+----------+------+------+ | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| +------+----------+------+------+
Concatenate two PySpark dataframes
In this article, we are going to see how to concatenate two pyspark dataframe using Python.
Creating Dataframe for demonstration:
Python3
# Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName( 'pyspark - example join' ).getOrCreate() # Create data in dataframe data = [(( 'Ram' ), '1991-04-01' , 'M' , 3000 ), (( 'Mike' ), '2000-05-19' , 'M' , 4000 ), (( 'Rohini' ), '1978-09-05' , 'M' , 4000 ), (( 'Maria' ), '1967-12-01' , 'F' , 4000 ), (( 'Jenis' ), '1980-02-17' , 'F' , 1200 )] # Column names in dataframe columns = [ "Name" , "DOB" , "Gender" , "salary" ] # Create the spark dataframe df1 = spark.createDataFrame(data = data, schema = columns) # Print the dataframe df1.show() |
Output:
+------+----------+------+------+ | Name| DOB|Gender|salary| +------+----------+------+------+ | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| +------+----------+------+------+
Creating Second dataframe for demonstration:
Python3
# Create data in dataframe data2 = [(( 'Mohi' ), '1991-04-01' , 'M' , 3000 ), (( 'Ani' ), '2000-05-19' , 'F' , 4300 ), (( 'Shipta' ), '1978-09-05' , 'F' , 4200 ), (( 'Jessy' ), '1967-12-01' , 'F' , 4010 ), (( 'kanne' ), '1980-02-17' , 'F' , 1200 )] # Column names in dataframe columns = [ "Name" , "DOB" , "Gender" , "salary" ] # Create the spark dataframe df2 = spark.createDataFrame(data = data, schema = columns) # Print the dataframe df2.show() |
Output:
+------+----------+------+------+ | Name| DOB|Gender|salary| +------+----------+------+------+ | Ram|1991-04-01| M| 3000| | Mike|2000-05-19| M| 4000| |Rohini|1978-09-05| M| 4000| | Maria|1967-12-01| F| 4000| | Jenis|1980-02-17| F| 1200| +------+----------+------+------+