Spark SQL has support for automatically inferring the schema from a JSON input source (each row is a standalone JSON file) - it does so by scanning the entire data set to create the schema but it's still useful. (I'm talking about 1.2.1, not the new 1.3, so there might be some changes) I've seen some conflicting posts about it being supported / not supported, but I think it was recently added (in 1.2) My question is - what is the right way to format a Date/Datetime/Timestamp in JSON for Spark SQL to identify it as such in it's auto schema inference mechanism?

The JSON type inference will never infer date types. Non-zero-length strings are always inferred to be strings. Source code: <pre class="prettyprint"><code>private[sql] object InferSchema { // ... private def inferField(parser: JsonParser): DataType = { import com.fasterxml.jackson.core.JsonToken._ parser.getCurrentToken match { // ... case VALUE_STRING => StringType // ... } } // ... } </code></pre> For automatic detection this would have to be changed to look at the actual string (<code>parser.getValueAsString</code>) and based on the format return <code>DateType</code> when appropriate. It's probably simpler to just take the normal auto-generated schema and convert the date types as a second step. Another option would be to read a small sample of the data (without using Spark) and infer the schema yourself. Then use your schema to create the DataFrame. This avoids some computation as well.

What is the right Date/Datetime format in JSON for Spark SQL to automatically infer the schema for it?

Tags:

apache-spark

apache-spark-sql

Spark SQL has support for automatically inferring the schema from a JSON input source (each row is a standalone JSON file) - it does so by scanning the entire data set to create the schema but it's still useful. (I'm talking about 1.2.1, not the new 1.3, so there might be some changes)

I've seen some conflicting posts about it being supported / not supported, but I think it was recently added (in 1.2)

My question is - what is the right way to format a Date/Datetime/Timestamp in JSON for Spark SQL to identify it as such in it's auto schema inference mechanism?

844

asked Mar 27 '15 15:03

Eran Medan

2 Answers

It is possible to infer dates using a format of your choosing (I used the Date.toJSON format) with a little modification and also have reasonable performance.

Get the latest maintenance branch:

git clone https://github.com/apache/spark.git
cd spark
git checkout branch-1.4

Replace the following block in InferSchema:

  case VALUE_STRING if parser.getTextLength < 1 =>
    // Zero length strings and nulls have special handling to deal
    // with JSON generators that do not distinguish between the two.
    // To accurately infer types for empty strings that are really
    // meant to represent nulls we assume that the two are isomorphic
    // but will defer treating null fields as strings until all the
    // record fields' types have been combined.
    NullType

  case VALUE_STRING => StringType

with the following code:

  case VALUE_STRING =>
    val len = parser.getTextLength
    if (len < 1) {
      NullType
    } else if (len == 24) {
      // try to match dates of the form "1968-01-01T12:34:56.789Z"
      // for performance, only try parsing if text is 24 chars long and ends with a Z
      val chars = parser.getTextCharacters
      val offset = parser.getTextOffset
      if (chars(offset + len - 1) == 'Z') {
        try {
          org.apache.spark.sql.catalyst.util.
            DateUtils.stringToTime(new String(chars, offset, len))
          TimestampType
        } catch {
          case e: Exception => StringType
        }
      } else {
        StringType
      }
    } else {
      StringType
    }

Build Spark according to your setup. I used:

mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests=true clean install

To test, create a file named datedPeople.json at the top level which contains the following data:

{"name":"Andy", "birthdate": "2012-04-23T18:25:43.511Z"}
{"name":"Bob"}
{"name":"This has 24 characters!!", "birthdate": "1988-11-24T11:21:13.121Z"}
{"name":"Dolla Dolla BillZZZZZZZZ", "birthdate": "1968-01-01T12:34:56.789Z"}

Read in the file. Make sure that you set the conf option before using sqlContext at all, or it won't work. Dates!

.\bin\spark-shell.cmd
scala> sqlContext.setConf("spark.sql.json.useJacksonStreamingAPI", "true")
scala> val datedPeople = sqlContext.read.json("datedPeople.json")

datedPeople: org.apache.spark.sql.DataFrame = [birthdate: timestamp, name: string]

scala> datedPeople.foreach(println)

[2012-04-23 13:25:43.511,Andy]
[1968-01-01 06:34:56.789,Dolla Dolla BillZZZZZZZZ]
[null,Bob]
[1988-11-24 05:21:13.121,This has 24 characters!!]

106

answered Jan 03 '23 21:01

heenenee

The JSON type inference will never infer date types. Non-zero-length strings are always inferred to be strings. Source code:

private[sql] object InferSchema {
  // ...
  private def inferField(parser: JsonParser): DataType = {
    import com.fasterxml.jackson.core.JsonToken._
    parser.getCurrentToken match {
      // ...
      case VALUE_STRING => StringType
      // ...
    }
  }
  // ...
}

For automatic detection this would have to be changed to look at the actual string (parser.getValueAsString) and based on the format return DateType when appropriate.

It's probably simpler to just take the normal auto-generated schema and convert the date types as a second step.

Another option would be to read a small sample of the data (without using Spark) and infer the schema yourself. Then use your schema to create the DataFrame. This avoids some computation as well.

answered Jan 03 '23 20:01

Daniel Darabos

Related questions
                            
                                How to execute .sql file in spark using python
                            
                                Duplicate columns in Spark Dataframe
                            
                                How can I return an empty (null?) item back from a map method in PySpark?
                            
                                how to get the column names and their datatypes of parquet file using pyspark?
                            
                                Spark not using spark.sql.parquet.compression.codec
                            
                                Set driver's memory size programmatically in PySpark
                            
                                Write spark dataframe to postgres Database
                            
                                Pyspark RDD .filter() with wildcard
                            
                                Read from BigQuery into Spark in efficient way?
                            
                                Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?
                            
                                How to concatenate multiple columns into single column (with no prior knowledge on their number)?
                            
                                How Spark Structured Streaming handles backpressure?
                            
                                Spark structured streaming consistency across sinks
                            
                                Why is Kafka consumer ignoring my "earliest" directive in the auto.offset.reset parameter and thus not reading my topic from the absolute first event?
                            
                                Assign value to specific cell in PySpark dataFrame
                            
                                How to get the value of the location for a Hive table using a Spark object?
                            
                                For each RDD in a DStream how do I convert this to an array or some other typical Java data type?
                            
                                Persist in memory not working in Spark
                            
                                JavaSparkContext not serializable
                            
                                Spark streaming network_wordcount.py does not print result

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With