Implement Inner Join in pyspark Dataframe

How to Implement Inner Join in pyspark Dataframe ?

GET FREE AMZAON AUDIOBOOKS

Are you looking for inner join in pyspark dataframe implementation? In this article, we will explore inner join in pyspark with examples. Firstly we will create two simple pyspark dataframe with very minimal data to understand it better. After it, we will apply the inner join on the top of it.

Inner join in pyspark dataframe –

Let’s start step by step.

Step 1: Creation of spark dataframe-

Let’s create the first dataframe with the following code.

import pyspark
from pyspark.sql import SparkSession
member_detail = [(1,"Abhishek","2018","10","M",30000), 
    (2,"Sukesh","2010","20","M",40000), 
    (3,"Avinash","2010","10","M",40000), 
    (4,"Saket","2005","10","M",35000), 
    (5,"Ankita","2010","40","F",38000)]
member_Columns = ["member_id","name","year", "dept_id","gender","Revenue"]
memberDF = spark.createDataFrame(data=member_detail, schema = member_Columns)
memberDF.show(truncate=False)

Here is the output-

pyspark dataFrame creation

Let’s create the second pyspark dataframe-

section_detail = [("Mens",10),
    ("Household",20), 
    ("Kids",30), 
    ("Electronics",40)]
sectionColumns = ["section_name","section_id"]
sectionDF = spark.createDataFrame(data=section_detail, schema = sectionColumns)
sectionDF.show(truncate=False)

Here is the output for the above piece of the code-

Pyspark second dataframe for inner merge
Pyspark second dataframe for inner merge

Step 2: Inner Merge –

In this section, we will merge the above two dataframe with inner join.  Inner join selects the common data points from both dataframe. Here is the code-

memberDF.join(sectionDF,memberDF.dept_id == sectionDF.section_id,"inner").show(truncate=False)
inner join in pyspark dataframe
inner join in pyspark dataframe

 

If you look at the above output, We have to merge dataframe memberDF and sectionDF on dept_id and section_id where the section_id is equal to dept_id.  For example, the dept_id is 1o which is equal to the section_id 10. This contains section_name as Male which is coming along in a new column. If you look closely there is section_id 30 for the “kids” section. This is not matching with any of the dept_id in the first dataframe. Hence it is not populating in inner join dataframe.

I hope you found this article helpful, still, if you need more explanation on this topic please let us know. Our team will resolve your doubt on priority. Please subscribe to us for more similar articles on Pyspark and data science in simple language.

 

Thanks 

Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner