Pyspark union Concept and Approach

Pyspark union Concept and Approach : With Code

GET FREE AMZAON AUDIOBOOKS

Pyspark union function appends two or more dataframe with same schema ( structure wise same). It generates error if structure is not same. In this article, we will first take two simple pyspark dataframe and then we will perform the union on the top of it.

Pyspark union Implementation –

The first step will be pre requisites for this. Lets create two sample pyspark data frame here with simlar schema ( Column name )

 

Step 1 ( Prerequisite) :-

Here is the code to generate first sample pyspark dataframe.

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000)
    ]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)
union dataframe prerequsite 1
union dataframe prerequsite 1

We will use the similar schema and create the second dataframe.Use the below code for the same.

import pyspark
from pyspark.sql import SparkSession
records = [ 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF_2 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF_2.show(truncate=False)
union dataframe prerequsite 2
union dataframe prerequisite 2

 

Till now we have successfully created two pyspark dataframe  ( sampleDF and SampleDF_2 )with similar schema. In the next step we will union these two dataframe.

Step 2:- Union Pyspark dataframe-

Lets see the syntax with the sample example. Although it is self explanatory but we will run and see the output as well for better understanding.

unionDataFrame = sampleDF.union(sampleDF_2)
unionDataFrame.show(truncate=False)

Here unionDataFrame should consist all the row of sampleDF as well as sampleDF_2 rows. Lets see the output-

pyspark union implementaion
pyspark union implementation

 

Here you can see we have all five rows from the above two dataframe.

Union without duplicates –

In most of the scenarios, we have dataframes with duplicate rows. In union, we do not want them so we need to change the syntax bit.

unionDataFrame = sampleDF.union(sampleDF_2).distinct()
unionDataFrame.show(truncate=False)

This will select only unique rows in the union. Anyways hope this article is informative for you. Please feel free if you have any doubts or concerns. Our Team will help you out there. Also do not forget to subscribe us for more articles on pyspark.

Thanks
Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner