We can implement Pyspark subtract dataset using exceptAll() and subtract() functions. Another alternative is left_anti join. Well in this article, we will take a dummy pyspark dataframes and explore all the techniques.
Dummy dataframes creation for exploration –
Use the below code for creating dummy dataframes , It will help in exploring the below solutions.
Pypark Dataframe Creation 1 :
Here we will create pyspark dataframe with five sample rows only.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('data science learner.com').getOrCreate()
records = [(1,"Mac","2018","10",30000),
(2,"Piyu","2010","20",40000),
(3,"Jack","2010","10",40000),
(4,"Charlee","2005","60",35000),
(5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF1 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF1.show(truncate=False)
Let’s see the dataframe in tabular format.

Pypark Dataframe Creation 2:
In this dummy dataframe we will keep three duplicate rows from the above dataframe only.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
records = [(1,"Mac","2018","10",30000),
(4,"Charlee","2005","60",35000),
(5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF2 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF2.show(truncate=False)
Here is the dataframe in tabular format.
Pyspark subtract dataset ( Solution )-
Solution 1 : Using subtract() function –
The solution is using the subtract() function. Here is the complete syntax for this >
final=sampleDF1.subtract(sampleDF2)
Here since the dataframe 1 and dataframe 2 have three rows common hence running this piece of code will return only two rows. Let’s run and see the output.

Solution 2 : Using exceptAll() function –
Just similar to this, we can use the exceptAll() function to subtract two pyspark datasets. Well, let’s see the below code for reference.
final=sampleDF1.exceptAll(sampleDF2)
final.show(truncate=False)
In a similar way to the above, we can run and check the output of the code. It should be identical to the above. I mean, it should also return those two, not common rows.

Must-Read Articles :
1.How do you find spark dataframe shape pyspark ( With Code ) ?
2.Pyspark withColumn : Syntax with Example
3.Pyspark add new row to dataframe : With Syntax and Example
Subtraction is a very important operation in every big data assignment. If we do not use the correct technique to perform this dataset subtraction, it will become computation-heavy. Anyways both the above techniques are pyspark performance optimized. It will take very less time.
Thanks
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.