Pyspark union function appends two or more dataframe with same schema ( structure wise same). It generates error if structure is not same. In this article, we will first take two simple pyspark dataframe and then we will perform the union on the top of it.
Pyspark union Implementation –
The first step will be pre requisites for this. Lets create two sample pyspark data frame here with simlar schema ( Column name )
Step 1 ( Prerequisite) :-
Here is the code to generate first sample pyspark dataframe.
import pyspark from pyspark.sql import SparkSession records = [(1,"Mac","2018","10",30000), (2,"Piyu","2010","20",40000), (3,"Jack","2010","10",40000) ] record_Columns = ["seq","Name","joining_year", "specialization_id","salary"] sampleDF = spark.createDataFrame(data=records, schema = record_Columns) sampleDF.show(truncate=False)
We will use the similar schema and create the second dataframe.Use the below code for the same.
import pyspark from pyspark.sql import SparkSession records = [ (4,"Charlee","2005","60",35000), (5,"Guo","2010","40",38000)] record_Columns = ["seq","Name","joining_year", "specialization_id","salary"] sampleDF_2 = spark.createDataFrame(data=records, schema = record_Columns) sampleDF_2.show(truncate=False)
Till now we have successfully created two pyspark dataframe ( sampleDF and SampleDF_2 )with similar schema. In the next step we will union these two dataframe.
Step 2:- Union Pyspark dataframe-
Lets see the syntax with the sample example. Although it is self explanatory but we will run and see the output as well for better understanding.
unionDataFrame = sampleDF.union(sampleDF_2) unionDataFrame.show(truncate=False)
Here unionDataFrame should consist all the row of sampleDF as well as sampleDF_2 rows. Lets see the output-
Here you can see we have all five rows from the above two dataframe.
Union without duplicates –
In most of the scenarios, we have dataframes with duplicate rows. In union, we do not want them so we need to change the syntax bit.
unionDataFrame = sampleDF.union(sampleDF_2).distinct() unionDataFrame.show(truncate=False)
This will select only unique rows in the union. Anyways hope this article is informative for you. Please feel free if you have any doubts or concerns. Our Team will help you out there. Also do not forget to subscribe us for more articles on pyspark.
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.