Pyspark left anti join is simple opposite to left join. It shows the only those records which are not match in left join. In this article we will understand them with examples step by step.
The first step would be to create two sample pyspark dataframe for explanation of the concept.
Lets create the first dataframe.
import pyspark
from pyspark.sql import SparkSession
records = [(1,"Rahul","2018","10",30000),
(2,"Syam","2010","20",40000),
(3,"Mohan","2010","10",40000),
(4,"Mac","2005","60",35000),
(5,"Tom","2010","40",38000)]
record_Columns = ["id","Name","year", "store_id","Cost"]
recordDF = spark.createDataFrame(data=records, schema = record_Columns)
recordDF.show(truncate=False)
Here we will use store_id for performing the join. Hence the second dataframe should contain that column. Let’s create the second dataframe.
store_master = [("Sports",10),("Books",20), ("Women",30), ("Men",40)]
store_master_columns = ["Catagory","Cat_id"]
store_masterDF = spark.createDataFrame(data=store_master, schema = store_master_columns)
store_masterDF.show(truncate=False)
Firstly let’s see the code and output. After it, I will explain the concept.
recordDF.join(store_masterDF,recordDF.store_id == store_masterDF.Cat_id,"leftanti").show(truncate=False)
Here is the output for the antileft join.
Here we are getting only one row of the First dataframe because only “store_id” ( 60 ) is not matching with any “Cat_id” of the second dataframe. The rest of the “store_id” has to match “Cat_id” in both of the dataframe. So now you can easily understand what is antileft join and how it works.
I will recommend again to see the implementation of left join and the related output. On the basis of it, It is very easy for us to understand the difference.
recordDF.join(store_masterDF,recordDF.store_id == store_masterDF.Cat_id,"left").show(truncate=False)
All we need to replace the “antileft” with “left” here. Now let’s see the output.
Now you may observe from the output if store_id is not matching with Cat_id, there is a null corresponding entry. However, in the antileft join, you are only getting the same row from the left dataframe which was not matching.
I hope this article on pyspark is helpful and informative for you. Please subscribe us to more similar articles on Pyspark and Data Science.
Thanks
Data Science Learner Team