PySpark

Pyspark union Concept and Approach : With Code

Pyspark union function appends two or more dataframe with same schema ( structure wise same). It generates error if structure is not same. In this article, we will first take two simple pyspark dataframe and then we will perform the union on top of it.

Pyspark union Implementation –

The first step will be prerequisites for this. Let’s create two sample pyspark dataframe here with similar schema ( Column name )

Step 1 ( Prerequisite) :-

Here is the code to generate the first sample pyspark dataframe.

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000)
    ]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)

union dataframe prerequsite 1

We will use a similar schema and create the second dataframe. Use the below code for the same.

import pyspark
from pyspark.sql import SparkSession
records = [ 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF_2 = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF_2.show(truncate=False)

union dataframe prerequisite 2

Till now we have successfully created two pyspark dataframe ( sampleDF and SampleDF_2 )with similar schema. In the next step we will union these two dataframe.

Step 2:- Union Pyspark dataframe-

Let’s see the syntax with the sample example. Although it is self-explanatory but we will run and see the output as well for better understanding.

unionDataFrame = sampleDF.union(sampleDF_2)
unionDataFrame.show(truncate=False)

Here unionDataFrame should consist of all the row of sampleDF as well as sampleDF_2 rows. Let’s see the output-

pyspark union implementation

Here you can see we have all five rows from the above two dataframe.

Union without duplicates –

In most of the scenarios, we have dataframes with duplicate rows. In union, we do not want them so we need to change the syntax bit.

unionDataFrame = sampleDF.union(sampleDF_2).distinct()
unionDataFrame.show(truncate=False)

This will select only unique rows in the union. Anyways hope this article is informative for you. Please feel free if you have any doubts or concerns. Our Team will help you out there. Also, do not forget to subscribe us for more articles on Pyspark.

Thanks
Data Science Learner Team