How do you find spark dataframe shape pyspark ( With Code ) ?

How do you find spark dataframe shape pyspark

We can get spark dataframe shape pyspark differently for row and column. We can use count() function for rows and len(df.columns()) for columns. Actually, most of us are pandas background where we do not have to explicitly go for rows and columns differently. In fact we simply use shape()  to get this information. Anyways, In this article, we will practically show you how to get pyspark dataframe shape.

spark dataframe shape pyspark ( Step by step ) –

Firstly, let’s create a pyspark dataframe to start with as a prerequisite.

Step 1: dummy pyspark dataframe creation –

Run the below code for the dummy pyspark dataframe.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .appName("DataScienceLearner.com") \
    .getOrCreate()
data = [('Abhishek', 50,'A'), ('Ankita', 45,'B'), ('Sukesh', 54,'A'),('Saket',64,'A')] 
pyspark_dataframe=spark.createDataFrame(data,["name","marks","grade"]) 
pyspark_dataframe.printSchema()
pyspark_dataframe.show()

Here we created the pyspark dataframe with 4 rows and 3 columns but how to get this information using code? The next section is all you need.

Step 2: Pyspark dataframe row and column count –

Very simple as I explained at the beginning using count() function and columns attribute.

print((pyspark_dataframe.count(), len(pyspark_dataframe.columns)))

Output –

spark dataframe shape pyspark
spark dataframe shape pyspark

Note –

Sometimes developer converts the pyspark dataframe to pandas and then uses the shape() function. But the problem with this approach is memory. All pyspark dataframe can not be converted to pandas. This is why people develop pyspark over pandas.

Additional Information on Pyspark –

1. How do you display a PySpark DataFrame in a table format?

If you have noticed the above code we displayed the pyspark dataframe using the show() function. Just to focus on the same please consider this  –

pyspark dataframe tabular format
pyspark dataframe tabular format

Make sure it only represents the 20 columns and 20 rows.

 

2. How to pyspark check dataframe memory size ?

Sometimes row and column counts are not enough. If you need memory size for the pyspark dataframe. I will tell the most simple technique.

2.1. Firstly take a fraction of dataframe and convert into pandas dataframe ( if fully conversion is not possible)

2.2  Use the info() function over the pandas dataframe to get this information.

import pandas
##sample = pyspark_dataframe.sample(fraction = 0.01)
pandas_df=pyspark_dataframe.toPandas()
pandas_df.info()
pyspark dataframe memory usages
pyspark dataframe memory usages

It gives us the size of pyspark dataframe in bytes.

 

Thanks

Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner