How to Implement Inner Join in pyspark Dataframe ?

Are you looking for inner join in pyspark dataframe implementation? In this article, we will explore inner join in pyspark with examples. Firstly we will create two simple pyspark dataframe with very minimal data to understand it better. After it, we will apply the inner join on the top of it.

Inner join in pyspark dataframe –

Let’s start step by step.

Step 1: Creation of spark dataframe-

Let’s create the first dataframe with the following code.

import pyspark
from pyspark.sql import SparkSession
member_detail = [(1,"Abhishek","2018","10","M",30000), 
    (2,"Sukesh","2010","20","M",40000), 
    (3,"Avinash","2010","10","M",40000), 
    (4,"Saket","2005","10","M",35000), 
    (5,"Ankita","2010","40","F",38000)]
member_Columns = ["member_id","name","year", "dept_id","gender","Revenue"]
memberDF = spark.createDataFrame(data=member_detail, schema = member_Columns)
memberDF.show(truncate=False)

Here is the output-

Let’s create the second pyspark dataframe-

section_detail = [("Mens",10),
    ("Household",20), 
    ("Kids",30), 
    ("Electronics",40)]
sectionColumns = ["section_name","section_id"]
sectionDF = spark.createDataFrame(data=section_detail, schema = sectionColumns)
sectionDF.show(truncate=False)

Here is the output for the above piece of the code-

Pyspark second dataframe for inner merge

Step 2: Inner Merge –

In this section, we will merge the above two dataframe with inner join. Inner join selects the common data points from both dataframe. Here is the code-

memberDF.join(sectionDF,memberDF.dept_id == sectionDF.section_id,"inner").show(truncate=False)

If you look at the above output, We have to merge dataframe memberDF and sectionDF on dept_id and section_id where the section_id is equal to dept_id. For example, the dept_id is 1o which is equal to the section_id 10. This contains section_name as Male which is coming along in a new column. If you look closely there is section_id 30 for the “kids” section. This is not matching with any of the dept_id in the first dataframe. Hence it is not populating in inner join dataframe.

I hope you found this article helpful, still, if you need more explanation on this topic please let us know. Our team will resolve your doubt on priority. Please subscribe to us for more similar articles on Pyspark and data science in simple language.

Thanks

Data Science Learner Team