to_timestamp pyspark function : String to Timestamp Conversion

to_timestamp pyspark function is the part of “pyspark.sql.functions” package. This to_timestamp() function convert string to timestamp object. In this article, we will try to understand the complete implementation through a dummy dataframe with minimal rows and data. We will step by step, firstly create the same and then perform the to_timestamp() function over its required column.

to_timestamp pyspark function : ( Implementation ) –

As I said the first step is to create a dummy Pyspark dataframe as prerequisite to this implementation explanation.

Step 1: (Prerequisite)-

We will also include import statements with this dummy data creation. Here is the code, Lets run the same-

from pyspark.sql.functions import *
df=spark.createDataFrame(
        data = [ ("100","2021-12-01 10:01:19"),
                ("101","2021-11-02 11:01:19"),
                ("102","2021-10-24 12:08:19")],
        schema=["Seq","string_timestamp"])
df.printSchema()

As I explained this to_timestamp() function is submodule of “pyspark.sql.functions” hence we need to import this first. Like in the above lines we have imported (*) which will import all internal modules out of this package. But let me tell you that (“*”) import is not best practices. I will recommend to import those function which we are calling in the code.
We are only mocking three rows with two columns named [“Seq”,”string_timestamp”]. Here the string_timestamp is the column which we will use to convert into timestamp format. Lets go to our second and final step.

Step 2 : Converting String column to Timestamp format in Pyspark –

In this step, we will create new column in the above pyspark dataframe by withColumn function.

df_modified=df.withColumn("converted_timestamp",to_timestamp("string_timestamp"))

The above code will generate a new column with the name of “converted_timestamp” where timestamp would be the data format. While when we mock the pyspark dataframe, It was in string data format. Lets put all the code together and run.

Complete Code –

from pyspark.sql.functions import *
df=spark.createDataFrame(
        data = [ ("100","2021-12-01 10:01:19"),
                ("101","2021-11-02 11:01:19"),
                ("102","2021-10-24 12:08:19")],
        schema=["Seq","string_timestamp"])
df.printSchema()

df_modified=df.withColumn("converted_timestamp",to_timestamp("string_timestamp"))
df_modified.show(truncate=False)
df_modified.printSchema()

Here is the output-

Here we can see that the data type for “converted_timestamp” column ( derived one) and”string_timestamp” column ( initial / original one ). This is what we basically want to achieve.

Thanks

Data Science Learner Team