Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the right Date/Datetime format in JSON for Spark SQL to automatically infer the schema for it?

Spark SQL has support for automatically inferring the schema from a JSON input source (each row is a standalone JSON file) - it does so by scanning the entire data set to create the schema but it's still useful. (I'm talking about 1.2.1, not the new 1.3, so there might be some changes)

I've seen some conflicting posts about it being supported / not supported, but I think it was recently added (in 1.2)

My question is - what is the right way to format a Date/Datetime/Timestamp in JSON for Spark SQL to identify it as such in it's auto schema inference mechanism?

like image 844
Eran Medan Avatar asked Mar 27 '15 15:03

Eran Medan


People also ask

How does Spark infer JSON schema?

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.

What is date time format in JSON?

JSON does not have a built-in type for date/time values. The general consensus is to store the date/time value as a string in ISO 8601 format. Example { "myDateTime": "2018-12-10T13:45:00.000Z" }

How are dates written in JSON?

There is no date format in JSON, there's only strings a de-/serializer decides to map to date values. However, JavaScript built-in JSON object and ISO8601 contains all the information to be understand by human and computer and does not relies on the beginning of the computer era (1970-1-1).

How do I change the date format in Spark SQL?

Spark to_date() – Convert String to Date format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. Below code, snippet takes the date in a string and converts it to date format on DataFrame.


2 Answers

It is possible to infer dates using a format of your choosing (I used the Date.toJSON format) with a little modification and also have reasonable performance.

Get the latest maintenance branch:

git clone https://github.com/apache/spark.git
cd spark
git checkout branch-1.4

Replace the following block in InferSchema:

  case VALUE_STRING if parser.getTextLength < 1 =>
    // Zero length strings and nulls have special handling to deal
    // with JSON generators that do not distinguish between the two.
    // To accurately infer types for empty strings that are really
    // meant to represent nulls we assume that the two are isomorphic
    // but will defer treating null fields as strings until all the
    // record fields' types have been combined.
    NullType

  case VALUE_STRING => StringType

with the following code:

  case VALUE_STRING =>
    val len = parser.getTextLength
    if (len < 1) {
      NullType
    } else if (len == 24) {
      // try to match dates of the form "1968-01-01T12:34:56.789Z"
      // for performance, only try parsing if text is 24 chars long and ends with a Z
      val chars = parser.getTextCharacters
      val offset = parser.getTextOffset
      if (chars(offset + len - 1) == 'Z') {
        try {
          org.apache.spark.sql.catalyst.util.
            DateUtils.stringToTime(new String(chars, offset, len))
          TimestampType
        } catch {
          case e: Exception => StringType
        }
      } else {
        StringType
      }
    } else {
      StringType
    }

Build Spark according to your setup. I used:

mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests=true clean install

To test, create a file named datedPeople.json at the top level which contains the following data:

{"name":"Andy", "birthdate": "2012-04-23T18:25:43.511Z"}
{"name":"Bob"}
{"name":"This has 24 characters!!", "birthdate": "1988-11-24T11:21:13.121Z"}
{"name":"Dolla Dolla BillZZZZZZZZ", "birthdate": "1968-01-01T12:34:56.789Z"}

Read in the file. Make sure that you set the conf option before using sqlContext at all, or it won't work. Dates!

.\bin\spark-shell.cmd
scala> sqlContext.setConf("spark.sql.json.useJacksonStreamingAPI", "true")
scala> val datedPeople = sqlContext.read.json("datedPeople.json")

datedPeople: org.apache.spark.sql.DataFrame = [birthdate: timestamp, name: string]

scala> datedPeople.foreach(println)

[2012-04-23 13:25:43.511,Andy]
[1968-01-01 06:34:56.789,Dolla Dolla BillZZZZZZZZ]
[null,Bob]
[1988-11-24 05:21:13.121,This has 24 characters!!]
like image 106
heenenee Avatar answered Jan 03 '23 21:01

heenenee


The JSON type inference will never infer date types. Non-zero-length strings are always inferred to be strings. Source code:

private[sql] object InferSchema {
  // ...
  private def inferField(parser: JsonParser): DataType = {
    import com.fasterxml.jackson.core.JsonToken._
    parser.getCurrentToken match {
      // ...
      case VALUE_STRING => StringType
      // ...
    }
  }
  // ...
}

For automatic detection this would have to be changed to look at the actual string (parser.getValueAsString) and based on the format return DateType when appropriate.

It's probably simpler to just take the normal auto-generated schema and convert the date types as a second step.

Another option would be to read a small sample of the data (without using Spark) and infer the schema yourself. Then use your schema to create the DataFrame. This avoids some computation as well.

like image 30
Daniel Darabos Avatar answered Jan 03 '23 20:01

Daniel Darabos