Pyspark union Concept and Approach : With Code

Pyspark union Concept and Approach

Pyspark union function appends two or more dataframe with same schema ( structure wise same). It generates error if structure is not same. In this article, we will first take two simple pyspark dataframe and then we will perform the union on top of it.

Pyspark union Implementation –

The first step will be prerequisites for this. Let’s create two sample pyspark dataframe here with similar schema ( Column name )

 

Step 1 ( Prerequisite) :-

Here is the code to generate the first sample pyspark dataframe.

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000)
    ]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)
union dataframe prerequsite 1
union dataframe prerequsite 1

We will use a similar schema and create the second dataframe. Use the below code for the same.

import pyspark
from pyspark.sql import SparkSession
records = [ 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF_2 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF_2.show(truncate=False)
union dataframe prerequsite 2
union dataframe prerequisite 2

 

Till now we have successfully created two pyspark dataframe  ( sampleDF and SampleDF_2 )with similar schema. In the next step we will union these two dataframe.

Step 2:- Union Pyspark dataframe-

Let’s see the syntax with the sample example. Although it is self-explanatory but we will run and see the output as well for better understanding.

unionDataFrame = sampleDF.union(sampleDF_2)
unionDataFrame.show(truncate=False)

Here unionDataFrame should consist of all the row of sampleDF as well as sampleDF_2 rows. Let’s see the output-

pyspark union implementaion
pyspark union implementation

 

Here you can see we have all five rows from the above two dataframe.

Union without duplicates –

In most of the scenarios, we have dataframes with duplicate rows. In union, we do not want them so we need to change the syntax bit.

unionDataFrame = sampleDF.union(sampleDF_2).distinct()
unionDataFrame.show(truncate=False)

This will select only unique rows in the union. Anyways hope this article is informative for you. Please feel free if you have any doubts or concerns. Our Team will help you out there. Also, do not forget to subscribe us for more articles on Pyspark.

Thanks
Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner