Pyspark read parquet : Get Syntax with Implementation

Pyspark read parquet

Pyspark read parquet is actually a function (spark.read.parquet(“path”)) for reading parquet file format in Hadoop Storage. In this article we will demonstrate the use of this function with a bare minimum example.

The syntax for Pyspark read parquet –

Here is the syntax for this function.

spark_dataframe=Spark.read.parquet("pathToParquetFile")

But please do not forget to add this prerequisite for calling this function.

Spark=SparkSession.builder.appName("parquetFile").getOrCreate()
read_parquet_df=Spark.read.parquet("path")

Here we first created the SparkSession Object for calling the read.parquet function in Pyspark.

Complete Implementation –

In this section, I will create a sample dataframe in front of you and write it into a parquet file. After it, I will call the same function to read the same. We are using a very simple parquet file to make you understand the functionality but this parquet file can be very large. It is very common in data engineering and the advanced data processing world just like CSV, JSON etc in the general programming world.

 

Step 1: Writing Sample parquet File –

Here is the complete code, Let’s go through-

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)
recordDF.write.parquet("sample.parquet") 

Once you run the same code, It will create the “sample.parquet” file at the same directory level where you are running the above code.

Step 2: Reading the Parquet file –

In this step, We will simply read the parquet file which we have just created –

Spark=SparkSession.builder.appName("parquetFile").getOrCreate()
read_parquet_df=Spark.read.parquet("sample.parquet")
read_parquet_df.head(1)

 

Pyspark read parquet
Pyspark read parquet

Here the head() function is just for our validation that the above code working as per expectation.

Well, I hope now we are clear with the use of read parquet function. This parquet file very general approach for storing a large amount of data in column structure. This really improved the query time for complex logics. Anyways! If you have any other questions related to this topic please comment below or write back to us. Also please do not forget to subscribe to us for more similar articles on Pyspark and Data Science.

Thanks 

Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner