PySpark

Pyspark Join two dataframes : Step By Step Tutorial

We can perform Pyspark join two dataframes with join() function. Here In the join function, we need to pass the name of dataframes for joining, On which field we are performing the join and what is the type of join we want. Hey, trust me it’s very easy !! For making it more simpler, I thought to make it step by step-

Pyspark join two dataframes – ( Steps )-

The first step is to create two Pyspark dataframe to understand how the join works. It is more like prerequisites for the second (main ) step.

Step 1: Pyspark sample dataframe creation –

Please run the below code for first dataframe creation –

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)

Pyspark dataframe_1

It will create the sampleDF ( Pyspark dataframe ). Here is the code for the second pyspark dataframe.

store_master = [("Sports",10),("Books",20), ("Women",30), ("Men",40)]
store_master_columns = ["Catagory","Cat_id"]
store_masterDF = spark.createDataFrame(data=store_master, schema = store_master_columns)
store_masterDF.show(truncate=False)

pyspark dataframe_2

Step 2: Joining Pyspark dataframes-

Here is the syntax for pyspark join.

DF_first.join(DF_second,DF_first.field_1==  DF_second.field_2"merge_type")

Here the merge type can be “inner”, “outer”, “left”, “full”. Lets see execute the code.

2.1 Pyspark Dataframe left join –

Here is the code for left join.

sampleDF.join(store_masterDF,sampleDF.specialization_id ==  store_masterDF.Cat_id,"left").show(truncate=False)

Left join works in the way where all values from the left side dataframe will come and along with it the matching value comes from the Right dataframe but non-matching value will be null. Refer to the below output.

Pyspark join two dataframes left

2.2 Pyspark Dataframe right join –

Here is the syntax for the Right join dataframe.

sampleDF.join(store_masterDF,sampleDF.specialization_id == store_masterDF.Cat_id,"right").show(truncate=False)

Here is the output for this.

Pyspark right join

In the right join, the matching values from the right will come and the rest will be null.

2.3 Pyspark Dataframe full join –

In full join, if the left dataframe is not matching with right, It will be null and vice versa. Here is the code for that-

sampleDF.join(store_masterDF,sampleDF.specialization_id ==  store_masterDF.Cat_id,"full").show(truncate=False)

Pyspark full join

I hope you must have found this article helpful for understanding the pyspark joins. Please subscribe to us for similar articles on pyspark and data science.

Thanks

Data Science Learner Team