Pyspark write parquet : Implementation in steps

Pyspark write parquet is very common functionality for writing big size pyspark data frame using (spark.write.parquet(“path”)) function. In this article we will see the implementation with self explainable example.

pyspark write parquet : ( Syntax) –

Lets see the one liner syntax for this function.

spark_dataframe.write.parquet("PathToSample.parquet") 

Here spark_dataframe is the final dataframe that you want to write as parquet file. PathToSample – path where we need to generate the parquet file.

 

Writing parquet file from spark dataframe –

Lets do this in steps. In the first step we will import necessary library and create objects etc. The second step will create sample pyspark dataframe. In the third step, we will write this sample dataframe into parquet file which is the final outcome for this article.

Step 1 : Importing packages –

Only these two packages are important for writing sample data frame to parquet file.

import pyspark
from pyspark.sql import SparkSession

Step 2 : sample dataframe –

I will create some sample data and columns and then convert the same to pyspark dataframe using the below code.

records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)

Step 3: Writing dataframe into parquet format-

Here we need to write one line and here we go –

sampleDF.write.parquet("sample_example.parquet")
pyspark write parquet output
pyspark write parquet output

 Full code together –

Here we have merge the code from above three steps.

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)
sampleDF.write.parquet("sample_example.parquet")

Basically this simple example is enough to understand the syntax and procedure to write spark dataframe into parquet file format. This parquet format is very popular because its speed up in general filtering or conditioning in tabular format data. Specially it is much better than larger size CSV file format.

I hope you must like this article. Please subscribe us for similar more articles on Pyspark and Data Science. Please reach out us if you need any article on some specific topic in the same domain. Our team will surely assist you for the same. You may also comment here for the same.

Thanks
Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner