to_timestamp pyspark function is the part of “pyspark.sql.functions” package. This to_timestamp() function convert string to timestamp object. In this article, we will try to understand the complete implementation through a dummy dataframe with minimal rows and data. We will step by step, firstly create the same and then perform the to_timestamp() function over its required column.
to_timestamp pyspark function : ( Implementation ) –
As I said the first step is to create a dummy Pyspark dataframe as prerequisite to this implementation explanation.
Step 1: (Prerequisite)-
We will also include import statements with this dummy data creation. Here is the code, Lets run the same-
from pyspark.sql.functions import * df=spark.createDataFrame( data = [ ("100","2021-12-01 10:01:19"), ("101","2021-11-02 11:01:19"), ("102","2021-10-24 12:08:19")], schema=["Seq","string_timestamp"]) df.printSchema()
- As I explained this to_timestamp() function is submodule of “pyspark.sql.functions” hence we need to import this first. Like in the above lines we have imported (*) which will import all internal modules out of this package. But let me tell you that (“*”) import is not best practices. I will recommend to import those function which we are calling in the code.
- We are only mocking three rows with two columns named [“Seq”,”string_timestamp”]. Here the string_timestamp is the column which we will use to convert into timestamp format. Lets go to our second and final step.
Step 2 : Converting String column to Timestamp format in Pyspark –
In this step, we will create new column in the above pyspark dataframe by withColumn function.
The above code will generate a new column with the name of “converted_timestamp” where timestamp would be the data format. While when we mock the pyspark dataframe, It was in string data format. Lets put all the code together and run.
Complete Code –
from pyspark.sql.functions import * df=spark.createDataFrame( data = [ ("100","2021-12-01 10:01:19"), ("101","2021-11-02 11:01:19"), ("102","2021-10-24 12:08:19")], schema=["Seq","string_timestamp"]) df.printSchema() df_modified=df.withColumn("converted_timestamp",to_timestamp("string_timestamp")) df_modified.show(truncate=False) df_modified.printSchema()
Here is the output-
Here we can see that the data type for “converted_timestamp” column ( derived one) and”string_timestamp” column ( initial / original one ). This is what we basically want to achieve.
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.