Pyspark withColumn : Syntax with Example

Pyspark withColumn featured image

Pyspark withColumn() function is useful in creating, transforming existing pyspark dataframe columns or changing the data type of column. In this article, we will see all the most common usages of withColumn() function.

 

2. Pyspark withColumn() –

In order to demonstrate the complete functionality, we will create a dummy Pyspark dataframe and secondly, we will explore the functionalities and concepts.

import pyspark
from pyspark.sql import SparkSession
records = [ 
    (4,"Charlee","2005","60",35000), 
    (5,"Guo","2010","40",38000)]
record_Columns = ["seq","Name","joining_year", "specialization_id","salary"]
sampleDF = spark.createDataFrame(data=records, schema = record_Columns)
sampleDF.show(truncate=False)

2.1 Changing datatype for column –

As we have specialization_id as Integer format, suppose you want to transform the same into float type. You may use the withColumn() function for the same.

sampleDF.withColumn("specialization_id ",col("specialization_id ").cast("Float")).show()

Note – Do not forget to import this before running the above lines.

from pyspark.sql.functions import col

Output –

withColumn Datatype casting
withColumn Datatype casting

2.2 Transformation of existing column using withColumn() –

Suppose you want to divide or multiply the existing column with some other value, Please use withColumn function. Here is the code for this-

sampleDF.withColumn("specialization_id_modified",col("specialization_id")* 2).show()
withColumn multiply with constant
withColumn multiply with constant

 

2.3  Creating new column in Pyspark dataframe using constant value –

Firstly, Before we operate it further, you need to import the lit module for the same. Let’s directly run the code and taste the water. Actually, it is self-explanatory stuff.

from pyspark.sql.functions import lit
sampleDF.withColumn("User_define_Constant_Col", lit("ABC")).show()
constant column using withColumn()
constant column using withColumn()

2.4 Renaming column using Pyspark –

Actually it is not exactly withColumn() but withColumnRename() , Lets see the example-

Rename Pyspark dataframe
Rename Pyspark dataframe

 

Above all, I hope you must have liked this article on withColumn(). This is one of the useful functions in Pyspark which every developer/data engineer.  Please write back to us if you have any concerns related to withColumn() function, You may also comment below in the comment box.

In addition,  please subscribe us for more articles on the same topic on Pyspark , data engineering, and data science.

 

Thanks 

Data Science Learner Team

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner