Pyspark Join two dataframes : Step By Step Tutorial

Pyspark Join two dataframes

We can perform Pyspark join two dataframes with join() function. Here In the join function, we need to pass the name of dataframes for joining, On which field we are performing the join and what is the type of join we want. Hey, trust me it’s very easy !! For making it more simpler, I thought to make it step by step-

Pyspark join two dataframes – ( Steps )-

The first step is to create two Pyspark dataframe to understand how the join works. It is more like prerequisites for the second (main ) step.

Step 1: Pyspark sample dataframe creation –

Please run the below code for first dataframe creation –

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)


Pyspark dataframe_1
Pyspark dataframe_1

It will create the sampleDF ( Pyspark dataframe ). Here is the code for the second pyspark dataframe.

store_master = [("Sports",10),("Books",20), ("Women",30), ("Men",40)]
store_master_columns = ["Catagory","Cat_id"]
store_masterDF = spark.createDataFrame(data=store_master, schema = store_master_columns)
pyspark dataframe_2
pyspark dataframe_2

Step 2: Joining Pyspark dataframes-

Here is the syntax for pyspark join.

DF_first.join(DF_second,DF_first.field_1==  DF_second.field_2"merge_type")

Here the merge type can be “inner”, “outer”, “left”, “full”. Lets see execute the code.

2.1 Pyspark Dataframe left join –

Here is the code for left join.

sampleDF.join(store_masterDF,sampleDF.specialization_id ==  store_masterDF.Cat_id,"left").show(truncate=False)

Left join works in the way where all values from the left side dataframe will come and along with it the matching value comes from the Right dataframe but non-matching value will be null. Refer to the below output.

Pyspark join two dataframes left
Pyspark join two dataframes left

2.2 Pyspark Dataframe right join –

Here is the syntax for the Right join dataframe.

sampleDF.join(store_masterDF,sampleDF.specialization_id == store_masterDF.Cat_id,"right").show(truncate=False)

Here is the output for this.

Pyspark right join
Pyspark right join

In the right join, the matching values from the right will come and the rest will be null.

2.3  Pyspark Dataframe full join –

In full join, if the left dataframe is not matching with right, It will be null and vice versa. Here is the code for that-

sampleDF.join(store_masterDF,sampleDF.specialization_id ==  store_masterDF.Cat_id,"full").show(truncate=False)


Pyspark full join
Pyspark full join

I hope you must have found this article helpful for understanding the pyspark joins. Please subscribe to us for similar articles on pyspark and data science.


Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages, where he and his team share knowledge and help others learn more about data science.
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner