I imported a PostgreSQL table into spark as a dataframe using Scala. The dataframe looks like
user_id | log_dt
--------| -------
96 | 2004-10-19 10:23:54.0
1020 | 2017-01-12 12:12:14.931652
I am transforming this dataframe to have the data format for log_dt as yyyy-MM-dd hh:mm:ss.SSSSSS
. To achieve this I used the following code to convert the log_dt to timestamp format using unix_timestamp
function.
val tablereader1 = tablereader1Df.withColumn("log_dt",unix_timestamp(tablereader1Df("log_dt"),"yyyy-MM-dd hh:mm:ss.SSSSSS").cast("timestamp"))
When I print to print the tablereader1 dataframe using the command tablereader1.show()
I get the following result
user_id | log_dt
--------| -------
96 | 2004-10-19 10:23:54.0
1020 | 2017-01-12 12:12:14.0
How can I retain the microseconds as part of the timestamp? Any suggestions are appreciated.
agg. (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns. The available aggregate methods are avg , max , min , sum , count .
Spark to_date() – Convert String to Date format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. Below code, snippet takes the date in a string and converts it to date format on DataFrame.
PySpark timestamp ( TimestampType ) consists of value in the format yyyy-MM-dd HH:mm:ss. SSSS and Date ( DateType ) format would be yyyy-MM-dd . Use to_date() function to truncate time from Timestamp or to convert the timestamp to date on DataFrame column.
date_format()
You can use Spark SQL date_format()
which accepts Java SimpleDateFormat patterns. SimpleDateFormat
can parse till milleseconds only with pattern "S".
import org.apache.spark.sql.functions._
import spark.implicits._ //to use $-notation on columns
val df = tablereader1Df.withColumn("log_dt", date_format($"log_dt", "S"))
//Imports
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.time.temporal.ChronoField;
/* //Commented as per comment about IntelliJ
spark.udf.register("date_microsec", (dt: String) =>
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.n")
LocalDateTime.parse(dt, dtFormatter).getLong(ChronoField.MICRO_OF_SECOND)
)
*/
import org.apache.spark.sql.functions.udf
val date_microsec = udf((dt: String) => {
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.n")
LocalDateTime.parse(dt, dtFormatter).getLong(ChronoField.MICRO_OF_SECOND)
})
Check: help in building DateTimeFormatter pattern
Use ChronoField.NANO_OF_SECOND
instead of ChronoField.MICRO_OF_SECOND
to fetch Nanosecond in UDF.
val df = tablereader1Df.withColumn("log_date_microsec", date_microsec($"log_dt"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With