PySpark

Pyspark Subtract Dataset : Step by Step Approach

We can implement Pyspark subtract dataset using exceptAll() and subtract() functions. Another alternative is left_anti join. Well in this article, we will take a dummy pyspark dataframes and explore all the techniques.

Dummy dataframes creation for exploration –

Use the below code for creating dummy dataframes , It will help in exploring the below solutions.

Pypark Dataframe Creation 1 :

Here we will create pyspark dataframe with five sample rows only.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('data science learner.com').getOrCreate()

records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF1 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF1.show(truncate=False)

Let’s see the dataframe in tabular format.

pyspark dataframe 1

Pypark Dataframe Creation 2:

In this dummy dataframe we will keep three duplicate rows from the above dataframe only.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
records = [(1,"Mac","2018","10",30000), 
           (4,"Charlee","2005","60",35000), 
           (5,"Guo","2010","40",38000)] 
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"] 
sampleDF2 = spark.createDataFrame(data=records, schema = record_Columns) 
sampleDF2.show(truncate=False)

Here is the dataframe in tabular format.

Pyspark subtract dataset ( Solution )-

Solution 1 : Using subtract() function –

The solution is using the subtract() function. Here is the complete syntax for this >

final=sampleDF1.subtract(sampleDF2)

Here since the dataframe 1 and dataframe 2 have three rows common hence running this piece of code will return only two rows. Let’s run and see the output.

Pyspark subtract dataset

Solution 2 : Using exceptAll() function –

Just similar to this, we can use the exceptAll() function to subtract two pyspark datasets. Well, let’s see the below code for reference.

final=sampleDF1.exceptAll(sampleDF2)
final.show(truncate=False)

In a similar way to the above, we can run and check the output of the code.  It should be identical to the above. I mean, it should also return those two, not common rows.

Pyspark except_all dataset

Must-Read Articles :

1.How do you find spark dataframe shape pyspark ( With Code ) ?

2.Pyspark withColumn : Syntax with Example

3.Pyspark add new row to dataframe : With Syntax and Example

Subtraction is a very important operation in every big data assignment. If we do not use the correct technique to perform this dataset subtraction, it will become computation-heavy. Anyways both the above techniques are pyspark performance optimized. It will take very less time.

Thanks

Data Science Learner Team