I'm trying to use the DateTimeFormatter from java.time.format in Spark but it appears to be not serializable. This is the relevant chunk of code:
val pattern = "<some pattern>".r
val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
val logs = sc.wholeTextFiles(path)
val entries = logs.flatMap(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(LocalDateTime.parse(dt, dtFormatter), ev, seq.toInt))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
How can I avoid the java.io.NotSerializableException: java.time.format.DateTimeFormatter
exception? Is there a better library to parse timestamps? I've read that Joda is also not serializable and has been incorporated in Java 8's time library.
A formatter created from a pattern can be used as many times as necessary, it is immutable and is thread-safe. The count of pattern letters determines the format.
The main API for dates, times, instants, and durations. Generic API for calendar systems other than the default ISO. Provides classes to print and parse dates and times.
LocalTime. Represents a time (hour, minute, second and nanoseconds (HH-mm-ss-ns)) LocalDateTime. Represents both a date and a time (yyyy-MM-dd-HH-mm-ss-ns) DateTimeFormatter.
You can avoid serialization in two ways:
Assuming its value can be constant, place the formatter in an object
(making it "static"). This would mean that the static value can be accessed within each worker, instead of the driver serializing it and sending to worker:
object MyUtils {
val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
}
import MyUtils._
logs.flatMap(fileContent => {
// can safely use formatter here
})
instantiate it per record inside the anonymous function. This carries some performance penalty (as the instantiation will happen over and over, per record), so only use this option if the first can't be applied:
logs.flatMap(fileContent => {
val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
// use formatter here
})
Another approach is to make the DateTimeFormatter transient. This tells the JVM/Spark that the variable is not to be serialized, and instead constructed from scratch. For something that is cheap to construct per executor, like a DateTimeFormatter, this is a good approach.
Here's an article that describes this in more detail.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With