Pyspark Left Anti Join : How to perform with examples ?

Pyspark Left Anti Join

Pyspark left anti join is simple opposite to left join. It shows the only those records which are not match in left join. In this article we will understand them with examples step by step.

pyspark left anti join ( Implementation ) –

The first step would be to create two sample pyspark dataframe for explanation of the concept.

Step 1 : ( Prerequisites ) –

Lets create the first dataframe.

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Rahul","2018","10",30000), 
    (2,"Syam","2010","20",40000), 
    (3,"Mohan","2010","10",40000), 
    (4,"Mac","2005","60",35000), 
    (5,"Tom","2010","40",38000)]
record_Columns = ["id","Name","year", "store_id","Cost"]
recordDF = spark.createDataFrame(data=records, schema = record_Columns)
recordDF.show(truncate=False)
pyspark dataframe 1
pyspark dataframe 1

Here we will use store_id for performing the join. Hence the second dataframe should contain that column. Let’s create the second dataframe.

store_master = [("Sports",10),("Books",20), ("Women",30), ("Men",40)]
store_master_columns = ["Catagory","Cat_id"]
store_masterDF = spark.createDataFrame(data=store_master, schema = store_master_columns)
store_masterDF.show(truncate=False)
pyspark dataframe 2
pyspark dataframe 2

 

Step 2: Anti left join implementation –

Firstly let’s see the code and output. After it, I will explain the concept.

recordDF.join(store_masterDF,recordDF.store_id == store_masterDF.Cat_id,"leftanti").show(truncate=False)

Here is the output for the antileft join.

pyspark left anti join implementation
pyspark left anti join implementation

 

Here we are getting only one row of the First dataframe because only “store_id” ( 60 ) is not matching with any “Cat_id” of the second dataframe.  The rest of the “store_id”  has to match “Cat_id”  in both of the dataframe.  So now you can easily understand what is antileft join and how it works.

Difference between left join and Antileft join –

I will recommend again to see the implementation of left join and the related output. On the basis of it, It is very easy for us to understand the difference.

recordDF.join(store_masterDF,recordDF.store_id ==  store_masterDF.Cat_id,"left").show(truncate=False)

All we need to replace the “antileft” with “left” here. Now let’s see the output.

left join pyspark
left join pyspark

Now you may observe from the output if store_id is not matching with Cat_id, there is a null corresponding entry. However, in the antileft join, you are only getting the same row from the left dataframe which was not matching.

I hope this article on pyspark is helpful and informative for you. Please subscribe us to more similar articles on Pyspark and Data Science.

Must-Read Articles :

How to Implement Inner Join in pyspark Dataframe ?

Pyspark Join two dataframes : Step By Step Tutorial

Pyspark union Concept and Approach : With Code

 

Thanks

Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner