Pyspark read parquet : Get Syntax with Implementation

Pyspark read parquet is actually a function (spark.read.parquet(“path”)) for reading parquet file format in Hadoop Storage. In this article we will demonstrate the use of this function with a bare minimum example.

The syntax for Pyspark read parquet –

Here is the syntax for this function.

spark_dataframe=Spark.read.parquet("pathToParquetFile")

But please do not forget to add this prerequisite for calling this function.

Spark=SparkSession.builder.appName("parquetFile").getOrCreate()
read_parquet_df=Spark.read.parquet("path")

Here we first created the SparkSession Object for calling the read.parquet function in Pyspark.

Complete Implementation –

In this section, I will create a sample dataframe in front of you and write it into a parquet file. After it, I will call the same function to read the same. We are using a very simple parquet file to make you understand the functionality but this parquet file can be very large. It is very common in data engineering and the advanced data processing world just like CSV, JSON etc in the general programming world.

Step 1: Writing Sample parquet File –

Here is the complete code, Let’s go through-

import pyspark
from pyspark.sql import SparkSession
records = [(1,"Mac","2018","10",30000), 
    (2,"Piyu","2010","20",40000), 
    (3,"Jack","2010","10",40000), 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)
recordDF.write.parquet("sample.parquet")

Once you run the same code, It will create the “sample.parquet” file at the same directory level where you are running the above code.

Step 2: Reading the Parquet file –

In this step, We will simply read the parquet file which we have just created –

Spark=SparkSession.builder.appName("parquetFile").getOrCreate()
read_parquet_df=Spark.read.parquet("sample.parquet")
read_parquet_df.head(1)

Here the head() function is just for our validation that the above code working as per expectation.

Well, I hope now we are clear with the use of read parquet function. This parquet file very general approach for storing a large amount of data in column structure. This really improved the query time for complex logics. Anyways! If you have any other questions related to this topic please comment below or write back to us. Also please do not forget to subscribe to us for more similar articles on Pyspark and Data Science.

Thanks

Data Science Learner Team