PySpark

Pyspark Left Anti Join : How to perform with examples ?

Pyspark left anti join is simple opposite to left join. It shows the only those records which are not match in left join. In this article we will understand them with examples step by step.

pyspark left anti join ( Implementation ) –

The first step would be to create two sample pyspark dataframe for explanation of the concept.

Step 1 : ( Prerequisites ) –

Lets create the first dataframe.

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Rahul","2018","10",30000), 
    (2,"Syam","2010","20",40000), 
    (3,"Mohan","2010","10",40000), 
    (4,"Mac","2005","60",35000), 
    (5,"Tom","2010","40",38000)]
record_Columns = ["id","Name","year", "store_id","Cost"]
recordDF = spark.createDataFrame(data=records, schema = record_Columns)
recordDF.show(truncate=False)

pyspark dataframe 1

Here we will use store_id for performing the join. Hence the second dataframe should contain that column. Let’s create the second dataframe.

store_master = [("Sports",10),("Books",20), ("Women",30), ("Men",40)]
store_master_columns = ["Catagory","Cat_id"]
store_masterDF = spark.createDataFrame(data=store_master, schema = store_master_columns)
store_masterDF.show(truncate=False)

pyspark dataframe 2

Step 2: Anti left join implementation –

Firstly let’s see the code and output. After it, I will explain the concept.

recordDF.join(store_masterDF,recordDF.store_id == store_masterDF.Cat_id,"leftanti").show(truncate=False)

Here is the output for the antileft join.

pyspark left anti join implementation

Here we are getting only one row of the First dataframe because only “store_id” ( 60 ) is not matching with any “Cat_id” of the second dataframe. The rest of the “store_id” has to match “Cat_id” in both of the dataframe. So now you can easily understand what is antileft join and how it works.

Difference between left join and Antileft join –

I will recommend again to see the implementation of left join and the related output. On the basis of it, It is very easy for us to understand the difference.

recordDF.join(store_masterDF,recordDF.store_id ==  store_masterDF.Cat_id,"left").show(truncate=False)

All we need to replace the “antileft” with “left” here. Now let’s see the output.

left join pyspark

Now you may observe from the output if store_id is not matching with Cat_id, there is a null corresponding entry. However, in the antileft join, you are only getting the same row from the left dataframe which was not matching.

I hope this article on pyspark is helpful and informative for you. Please subscribe us to more similar articles on Pyspark and Data Science.

Pyspark Left Anti Join : How to perform with examples ?

pyspark left anti join ( Implementation ) –

Step 1 : ( Prerequisites ) –

Step 2: Anti left join implementation –

Difference between left join and Antileft join –

Must-Read Articles :

How to Implement Inner Join in pyspark Dataframe ?

Pyspark Join two dataframes : Step By Step Tutorial

Pyspark union Concept and Approach : With Code

Pyspark Left Anti Join : How to perform with examples ?

pyspark left anti join ( Implementation ) –

Step 1 : ( Prerequisites ) –

Step 2: Anti left join implementation –

Difference between left join and Antileft join –

Must-Read Articles :

How to Implement Inner Join in pyspark Dataframe ?

Pyspark Join two dataframes : Step By Step Tutorial

Pyspark union Concept and Approach : With Code

Related Post