We can implement Pyspark subtract dataset using exceptAll() and subtract() functions. Another alternative is left_anti join. Well in this article, we will take a dummy pyspark dataframes and explore all the techniques.
Use the below code for creating dummy dataframes , It will help in exploring the below solutions.
Here we will create pyspark dataframe with five sample rows only.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('data science learner.com').getOrCreate()
records = [(1,"Mac","2018","10",30000),
(2,"Piyu","2010","20",40000),
(3,"Jack","2010","10",40000),
(4,"Charlee","2005","60",35000),
(5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF1 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF1.show(truncate=False)
Let’s see the dataframe in tabular format.
In this dummy dataframe we will keep three duplicate rows from the above dataframe only.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
records = [(1,"Mac","2018","10",30000),
(4,"Charlee","2005","60",35000),
(5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF2 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF2.show(truncate=False)
Here is the dataframe in tabular format.
The solution is using the subtract() function. Here is the complete syntax for this >
final=sampleDF1.subtract(sampleDF2)
Here since the dataframe 1 and dataframe 2 have three rows common hence running this piece of code will return only two rows. Let’s run and see the output.
Just similar to this, we can use the exceptAll() function to subtract two pyspark datasets. Well, let’s see the below code for reference.
final=sampleDF1.exceptAll(sampleDF2)
final.show(truncate=False)
In a similar way to the above, we can run and check the output of the code. It should be identical to the above. I mean, it should also return those two, not common rows.
2.Pyspark withColumn : Syntax with Example
3.Pyspark add new row to dataframe : With Syntax and Example
Subtraction is a very important operation in every big data assignment. If we do not use the correct technique to perform this dataset subtraction, it will become computation-heavy. Anyways both the above techniques are pyspark performance optimized. It will take very less time.
Thanks
Data Science Learner Team