Pyspark Subtract Dataset : Step by Step Approach

Pyspark Subtract Dataset _ Step by Step Approach

We can implement Pyspark subtract dataset using exceptAll() and subtract() functions. Another alternative is left_anti join. Well in this article, we will take a dummy pyspark dataframes and explore all the techniques.

 Dummy dataframes creation for exploration –

Use the below code for creating dummy dataframes , It will help in exploring the below solutions.

Pypark Dataframe Creation 1 :

Here we will create pyspark dataframe with five sample rows only.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('data science learner.com').getOrCreate()

records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF1 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF1.show(truncate=False)

Let’s see the dataframe in tabular format.

pyspark dataframe 1
pyspark dataframe 1

Pypark Dataframe Creation 2:

In this dummy dataframe we will keep three duplicate rows from the above dataframe only.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
records = [(1,"Mac","2018","10",30000), 
           (4,"Charlee","2005","60",35000), 
           (5,"Guo","2010","40",38000)] 
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"] 
sampleDF2 = spark.createDataFrame(data=records, schema = record_Columns) 
sampleDF2.show(truncate=False)

Here is the dataframe in tabular format.

pyspark dataframe 2

Pyspark subtract dataset ( Solution )-

Solution 1 : Using subtract() function –

The solution is using the subtract() function. Here is the complete syntax for this >

final=sampleDF1.subtract(sampleDF2)

Here since the dataframe 1 and dataframe 2 have three rows common hence running this piece of code will return only two rows. Let’s run and see the output.

Pyspark subtract dataset
Pyspark subtract dataset

Solution 2 : Using exceptAll() function –

Just similar to this, we can use the exceptAll() function to subtract two pyspark datasets. Well, let’s see the below code for reference.

final=sampleDF1.exceptAll(sampleDF2)
final.show(truncate=False)

In a similar way to the above, we can run and check the output of the code.  It should be identical to the above. I mean, it should also return those two, not common rows.

Pyspark except_all dataset
Pyspark except_all dataset

Must-Read Articles :

1.How do you find spark dataframe shape pyspark ( With Code ) ?

2.Pyspark withColumn : Syntax with Example

3.Pyspark add new row to dataframe : With Syntax and Example

Subtraction is a very important operation in every big data assignment. If we do not use the correct technique to perform this dataset subtraction, it will become computation-heavy. Anyways both the above techniques are pyspark performance optimized. It will take very less time.

Thanks

Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner