Spark SQL has support for automatically inferring the schema from a JSON input source (each row is a standalone JSON file) - it does so by scanning the entire data set to create the schema but it's still useful. (I'm talking about 1.2.1, not the new 1.3, so there might be some changes)
I've seen some conflicting posts about it being supported / not supported, but I think it was recently added (in 1.2)
My question is - what is the right way to format a Date/Datetime/Timestamp in JSON for Spark SQL to identify it as such in it's auto schema inference mechanism?
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.
JSON does not have a built-in type for date/time values. The general consensus is to store the date/time value as a string in ISO 8601 format. Example { "myDateTime": "2018-12-10T13:45:00.000Z" }
There is no date format in JSON, there's only strings a de-/serializer decides to map to date values. However, JavaScript built-in JSON object and ISO8601 contains all the information to be understand by human and computer and does not relies on the beginning of the computer era (1970-1-1).
Spark to_date() – Convert String to Date format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. Below code, snippet takes the date in a string and converts it to date format on DataFrame.
It is possible to infer dates using a format of your choosing (I used the Date.toJSON format) with a little modification and also have reasonable performance.
Get the latest maintenance branch:
git clone https://github.com/apache/spark.git
cd spark
git checkout branch-1.4
Replace the following block in InferSchema:
case VALUE_STRING if parser.getTextLength < 1 =>
// Zero length strings and nulls have special handling to deal
// with JSON generators that do not distinguish between the two.
// To accurately infer types for empty strings that are really
// meant to represent nulls we assume that the two are isomorphic
// but will defer treating null fields as strings until all the
// record fields' types have been combined.
NullType
case VALUE_STRING => StringType
with the following code:
case VALUE_STRING =>
val len = parser.getTextLength
if (len < 1) {
NullType
} else if (len == 24) {
// try to match dates of the form "1968-01-01T12:34:56.789Z"
// for performance, only try parsing if text is 24 chars long and ends with a Z
val chars = parser.getTextCharacters
val offset = parser.getTextOffset
if (chars(offset + len - 1) == 'Z') {
try {
org.apache.spark.sql.catalyst.util.
DateUtils.stringToTime(new String(chars, offset, len))
TimestampType
} catch {
case e: Exception => StringType
}
} else {
StringType
}
} else {
StringType
}
Build Spark according to your setup. I used:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests=true clean install
To test, create a file named datedPeople.json
at the top level which contains the following data:
{"name":"Andy", "birthdate": "2012-04-23T18:25:43.511Z"}
{"name":"Bob"}
{"name":"This has 24 characters!!", "birthdate": "1988-11-24T11:21:13.121Z"}
{"name":"Dolla Dolla BillZZZZZZZZ", "birthdate": "1968-01-01T12:34:56.789Z"}
Read in the file. Make sure that you set the conf option before using sqlContext
at all, or it won't work. Dates!
.\bin\spark-shell.cmd
scala> sqlContext.setConf("spark.sql.json.useJacksonStreamingAPI", "true")
scala> val datedPeople = sqlContext.read.json("datedPeople.json")
datedPeople: org.apache.spark.sql.DataFrame = [birthdate: timestamp, name: string]
scala> datedPeople.foreach(println)
[2012-04-23 13:25:43.511,Andy]
[1968-01-01 06:34:56.789,Dolla Dolla BillZZZZZZZZ]
[null,Bob]
[1988-11-24 05:21:13.121,This has 24 characters!!]
The JSON type inference will never infer date types. Non-zero-length strings are always inferred to be strings. Source code:
private[sql] object InferSchema {
// ...
private def inferField(parser: JsonParser): DataType = {
import com.fasterxml.jackson.core.JsonToken._
parser.getCurrentToken match {
// ...
case VALUE_STRING => StringType
// ...
}
}
// ...
}
For automatic detection this would have to be changed to look at the actual string (parser.getValueAsString
) and based on the format return DateType
when appropriate.
It's probably simpler to just take the normal auto-generated schema and convert the date types as a second step.
Another option would be to read a small sample of the data (without using Spark) and infer the schema yourself. Then use your schema to create the DataFrame. This avoids some computation as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With