PySpark

Pyspark write parquet : Implementation in steps

Pyspark write parquet is very common functionality for writing big size pyspark data frame using (spark.write.parquet(“path”)) function. In this article we will see the implementation with self explainable example.

pyspark write parquet : ( Syntax) –

Lets see the one liner syntax for this function.

spark_dataframe.write.parquet("PathToSample.parquet")

Here spark_dataframe is the final dataframe that you want to write as parquet file. PathToSample – path where we need to generate the parquet file.

Writing parquet file from spark dataframe –

Lets do this in steps. In the first step we will import necessary library and create objects etc. The second step will create sample pyspark dataframe. In the third step, we will write this sample dataframe into parquet file which is the final outcome for this article.

Step 1 : Importing packages –

Only these two packages are important for writing sample data frame to parquet file.

import pyspark
from pyspark.sql import SparkSession

Step 2 : sample dataframe –

I will create some sample data and columns and then convert the same to pyspark dataframe using the below code.

records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)

Step 3: Writing dataframe into parquet format-

Here we need to write one line and here we go –

sampleDF.write.parquet("sample_example.parquet")

pyspark write parquet output

Full code together –

Here we have merge the code from above three steps.

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)
sampleDF.write.parquet("sample_example.parquet")

Basically this simple example is enough to understand the syntax and procedure to write spark dataframe into parquet file format. This parquet format is very popular because its speed up in general filtering or conditioning in tabular format data. Specially it is much better than larger size CSV file format.

I hope you must like this article. Please subscribe us for similar more articles on Pyspark and Data Science. Please reach out us if you need any article on some specific topic in the same domain. Our team will surely assist you for the same. You may also comment here for the same.

Thanks
Data Science Learner Team