Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark and Not Serializable DateTimeFormatter

I'm trying to use the DateTimeFormatter from java.time.format in Spark but it appears to be not serializable. This is the relevant chunk of code:

val pattern = "<some pattern>".r
val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")

val logs = sc.wholeTextFiles(path)

val entries = logs.flatMap(fileContent => {
    val file = fileContent._1
    val content = fileContent._2
    content.split("\\r?\\n").map(line => line match {
      case pattern(dt, ev, seq) => Some(LogEntry(LocalDateTime.parse(dt, dtFormatter), ev, seq.toInt))
      case _ => logger.error(s"Cannot parse $file: $line"); None
    })
  })

How can I avoid the java.io.NotSerializableException: java.time.format.DateTimeFormatter exception? Is there a better library to parse timestamps? I've read that Joda is also not serializable and has been incorporated in Java 8's time library.

like image 920
Ian Avatar asked Mar 21 '16 13:03

Ian


People also ask

Is Java time DateTimeFormatter thread-safe?

A formatter created from a pattern can be used as many times as necessary, it is immutable and is thread-safe. The count of pattern letters determines the format.

What is the use of DateTimeFormatter?

The main API for dates, times, instants, and durations. Generic API for calendar systems other than the default ISO. Provides classes to print and parse dates and times.

What is Java DateTimeFormatter?

LocalTime. Represents a time (hour, minute, second and nanoseconds (HH-mm-ss-ns)) LocalDateTime. Represents both a date and a time (yyyy-MM-dd-HH-mm-ss-ns) DateTimeFormatter.


2 Answers

You can avoid serialization in two ways:

  1. Assuming its value can be constant, place the formatter in an object (making it "static"). This would mean that the static value can be accessed within each worker, instead of the driver serializing it and sending to worker:

    object MyUtils {
      val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
    }
    
    import MyUtils._
    logs.flatMap(fileContent => {
      // can safely use formatter here
    })
    
  2. instantiate it per record inside the anonymous function. This carries some performance penalty (as the instantiation will happen over and over, per record), so only use this option if the first can't be applied:

    logs.flatMap(fileContent => {
      val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
      // use formatter here
    })
    
like image 122
Tzach Zohar Avatar answered Oct 18 '22 21:10

Tzach Zohar


Another approach is to make the DateTimeFormatter transient. This tells the JVM/Spark that the variable is not to be serialized, and instead constructed from scratch. For something that is cheap to construct per executor, like a DateTimeFormatter, this is a good approach.

Here's an article that describes this in more detail.

like image 36
Eyal Avatar answered Oct 18 '22 19:10

Eyal