I'm dealing with data input in the form of json documents. These documents need to have a certain format, if they're not compliant, they should be ignored. I'm currently using a messy list of 'if thens' to check the format of the json document.
I have been experimenting a bit with different python json-schema libraries, which works ok, but I'm still able to submit a document with keys not described in the schema, which makes it useless to me.
This example doesn't generate an exception although I would expect it:
#!/usr/bin/python
from jsonschema import Validator
checker = Validator()
schema = {
"type" : "object",
"properties" : {
"source" : {
"type" : "object",
"properties" : {
"name" : {"type" : "string" }
}
}
}
}
data ={
"source":{
"name":"blah",
"bad_key":"This data is not allowed according to the schema."
}
}
checker.validate(data,schema)
My question is twofold:
Thanks,
Jay
The best way to ensure the high data quality of your datasets is to perform up-front data validation. Check the accuracy and completeness of collected data before you add it to your data warehouse. This will increase the time you need to integrate new data sources into your data warehouse.
Pandera is an open-source application programming interface (API) in python. It is a flexible and expressive API for falsification so that a coherent and robust data pipeline could be built.
Add "additionalProperties": False
:
#!/usr/bin/python
from jsonschema import Validator
checker = Validator()
schema = {
"type" : "object",
"properties" : {
"source" : {
"type" : "object",
"properties" : {
"name" : {"type" : "string" }
},
"additionalProperties": False, # add this
}
}
}
data ={
"source":{
"name":"blah",
"bad_key":"This data is not allowed according to the schema."
}
}
checker.validate(data,schema)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With