Pyspark column is not iterable error occurs only when we try to access any pyspark column as a function since columns are not callable objects. Actually, this is not a pyspark specific error. The generic error is TypeError: ‘Column’ object is not callable. Since it is coming for pyspark dataframe hence we call in the above way. However, the same error is also possible with pandas, etc. Well In this article, we are going to uncover this error with one practical example. We will also understand the best way to fix the error.
pyspark column is not iterable : ( Root Cause and Fix ) –
Let’s create a dummy pyspark dataframe and then create a scenario where we can replicate this error. here is the code to create a dummy pyspark dataframe.
import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Data Science Learner').getOrCreate() data_df = [[1, "Abhishek", "A"], [2, "Ankita", "B"], [3, "Sukesh", "C"]] columns = ['Seq', 'Name', 'Identifier'] dataframe = spark.createDataFrame(data_df, columns) dataframe.show()
Let’s run and see if dummy pyspark dataframe is created?
Yes, we have created the same. Now let’s apply any condition over any column. Here we will replicate the same error.
dataframe.select('Identifier').where(dataframe.Identifier() < B).show()
Here we are getting this error because Identifier is a pyspark column. But we are treating it as a function here. That is the root cause of this error. Only any form of function in Python is callable. NoneType, List , Tuple, int and str are not callable.
Fixing this bug by syntax correction –
As we already explained this is just a syntax error. We can simply fix the same by removing parenthesis after the column name of pyspark dataframe. To make it more clear, In the above example, we used dataframe.Identifier() which is incorrect. Since it represents a function ( callable object ) if we remove the same and access the column incorrect way, We will get rid of the error.
dataframe.select('Identifier').where(dataframe.Identifier < 'B').show()
Not exactly but a quite a similar error occurs when we try to access the complete dataframe as callable object. Hope now the basics are pretty clear to us.
Frequently Asked Questions
1. What is Pyspark?
Pyspark is a programming library that acts as an interface to create Pyspark Dataframes. Apache Spark is an open-source, big data processing system that is designed to be fast and easy to use. Using it you can perform powerful data processing capabilities.
2. How do I install Pyspark?
It’s very to install Pyspark. Just open your terminal or command prompt and use the pip command. But before that, you have to also check the version of python.
To check the python version use the below command.
If the version is 3. xx then use the pip3 and if it is 2. xx then use the pip command.
Run the below command to install Pyspark in your system.
For python 3. xx
pip3 install pyspark
For python 2. xx
pip install pyspark
3. How do you get columns in PySpark?
There can be different ways to get the columns in Pyspark. The general way to get columns is the use of the select() method. For example, if you want to get the column name “A” then you have to use the below line of code.
df = df.select("A")
For getting more than one column
df = df.select(["A","B"])
You can also apply conditions on the column like below.
df = df.select(df["B]>50)
It will return a new DataFrame with only the columns where the value in the column “B” is greater than 50.
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.