We can perform Pyspark join two dataframes with join() function. Here In the join function, we need to pass the name of dataframes for joining, On which field we are performing the join and what is the type of join we want. Hey, trust me it’s very easy !! For making it more simpler, I thought to make it step by step-
The first step is to create two Pyspark dataframe to understand how the join works. It is more like prerequisites for the second (main ) step.
Please run the below code for first dataframe creation –
import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000),
(2,"Piyu","2010","20",40000),
(3,"Jack","2010","10",40000),
(4,"Charlee","2005","60",35000),
(5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)
It will create the sampleDF ( Pyspark dataframe ). Here is the code for the second pyspark dataframe.
store_master = [("Sports",10),("Books",20), ("Women",30), ("Men",40)]
store_master_columns = ["Catagory","Cat_id"]
store_masterDF = spark.createDataFrame(data=store_master, schema = store_master_columns)
store_masterDF.show(truncate=False)
Here is the syntax for pyspark join.
DF_first.join(DF_second,DF_first.field_1== DF_second.field_2"merge_type")
Here the merge type can be “inner”, “outer”, “left”, “full”. Lets see execute the code.
Here is the code for left join.
sampleDF.join(store_masterDF,sampleDF.specialization_id == store_masterDF.Cat_id,"left").show(truncate=False)
Left join works in the way where all values from the left side dataframe will come and along with it the matching value comes from the Right dataframe but non-matching value will be null. Refer to the below output.
Here is the syntax for the Right join dataframe.
sampleDF.join(store_masterDF,sampleDF.specialization_id == store_masterDF.Cat_id,"right").show(truncate=False)
Here is the output for this.
In the right join, the matching values from the right will come and the rest will be null.
In full join, if the left dataframe is not matching with right, It will be null and vice versa. Here is the code for that-
sampleDF.join(store_masterDF,sampleDF.specialization_id == store_masterDF.Cat_id,"full").show(truncate=False)
I hope you must have found this article helpful for understanding the pyspark joins. Please subscribe to us for similar articles on pyspark and data science.
Thanks
Data Science Learner Team