Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSON Column OR Traditional Columns

My team is working on an invoicing module where our users need to be able to add line items to an invoice and store those line items in our database. These line items can be edited after the initial create.

An invoice's line item would look something like this.

LineItemName
EnteredBy
Quantity
CostPerUnit
Subtotal

One of our team members is suggesting that we store our row data into a JSON column rather than in multiple SQL columns. His argument is that it will be simpler to store all the line item data into a single JSON column rather than write the code to detect which line items may have been deleted, updated, re-ordered, or otherwise manipulated from the database's original state.

I have not worked with JSON columns before, and from what I can tell there are a number of performance concerns when using these, as well as some additional complexities when building queries off of the JSON's data column. We will have to report off of these line items so performance is definitely a concern. We also are running SQL Server 2012 which as I understand does not contain native support for JSON columns unless we upgrade to SQL Server 2016. Side note, we most likely will move to MYSQL within the next 2-3 years.

Could anyone provide some guidance on what the right call is here? My instinct is that we should leverage existing methods and write the additional code to detect the database changes in order to avoid the headache of performance concerns and reporting complexity later.

like image 208
Hunter Nelson Avatar asked Oct 30 '15 14:10

Hunter Nelson


People also ask

What are columns in JSON?

JSON columns, like columns of other binary types, are not indexed directly; instead, you can create an index on a generated column that extracts a scalar value from the JSON column. See Indexing a Generated Column to Provide a JSON Column Index, for a detailed example.

What is the drawbacks of JSON columns?

The drawback? If your JSON has multiple fields with the same key, only one of them, the last one, will be retained. The other drawback is that MySQL doesn't support indexing JSON columns, which means that searching through your JSON documents could result in a full table scan.

How can you list all columns for a given table?

In a query editor, if you highlight the text of table name (ex dbo. MyTable) and hit ALT + F1 , you'll get a list of column names, type, length, etc.

How do I query JSON data in SQL?

To query JSON data, you can use standard T-SQL. If you must create a query or report on JSON data, you can easily convert JSON data to rows and columns by calling the OPENJSON rowset function. For more information, see Convert JSON Data to Rows and Columns with OPENJSON (SQL Server).


2 Answers

I would try to give a slightly different answer :)

Use relational columns if you expect a lot of updates and calculations. Referencing and updating columns is better than updating and referencing JSON fields. In this case you are optimizing DML performance and probably some analytics.

Use JSON/XML if you have information that is rarely changed, and if you want to avoid too many JOINs, or if your application expects JSON. In this case you will optimize read and load performance.

In the previous answer you can see a lot of benefits of relation schema and I cannot argue that this is wrong. However, I would mention few use cases where JSON can help:

  1. Imagine that you have large tables where you need to join 100K invoices with 1M invoice line items. In relational model you will have two table scans and JOIN, while with JSON/XML you will have single table scan. If your application expects response formatted as JSON (e.g. you are sending line items as JSON to angular, Knockout, or other JavaScript template via Ajax call) JSON would be perfect choice. Imagine how would look query on more complex structure compared to singe table scan with JSON. De-normalization is one of the oldest trick that improve query performance and JSON is just one of the de-normalization technique like materialized views, aggregations in OLAP cubes etc. It is not solution for all your problems but it helps in some scenarios.
  2. Imagine that you need to import parent/child tables. You need to import one invoice row, take @@identity, use that identity to insert related line items, and repeat this for each imported invoice. Alternative would be to force id by setting IDENTITY INSERT ON. With JSON/XML, if you have line items formatted as JSON as part of each invoice, you can use simple bulk import, which is the fastest way to load data.

These are some reasons why people are switching to NoSQL (e.g. MongoDB or Azure DocumentDB). In SQL 2016 will be supported JSON, and in previous versions you would need to use XML but principles are the same.

In your case, it seems that you will frequently update line items and you don't need fast read/load scenario so I would propose relation schema.

like image 167
Jovan MSFT Avatar answered Oct 14 '22 18:10

Jovan MSFT


Short answer: Do not store in JSON, use columns, this is why they are there.

Long answer

You are using relational database to store your data, use the features what those softwares are providing to store and organize your data.

As it is already appeared in the comments, storing values in their separate columns, allows you to perform different aggregations, filtering on those columns without the overhead of parsing a non-relational data structure (most probably using a third party plugin/clr/function/whatever).

Additionally JSON data has no fixed structure. You can not verify the consistency of the data stored in the JSON field without parsing the field and writing custom validations.

Storing multiple data in one field also means, you can not (or not easy to)

  • use constraints on the embedded fields
  • you can not enforce the structure, valid ranges, valid values for each fields for the embedded fields
  • define the data type of the embedded fields
  • index the data (neither include them)
  • aggregate/search in those fields
  • scale the system
  • query the metadata, the list of fields, etc for the embedded ones

The database server can not

  • keep track of index statistics for each fields
  • optimize queries using the JSON field (because of the string manipulations required to extract data)
  • can not store the data for each field optimally.

The above things are important, but none of the lists are complete.

What will you win?

  • Some field names.
  • Some flexibility in the database, but much more complex functions in the application, because all the validations should be done in the application twice - when you want to write and when you are reading the data.
  • A huge headache when you have to fix something in the JSON field.

As an alternative, you can use XML columns, it is supported by SQL Server and some of the above things are not a problem, BUT: it still has no fixed structure. (if it has, you can store the data in traditional columns. In both cases, you have to specify the structure manually).

As a note: the format you choose for storing the data is opinion based of course, but as a rule of thumb, use traditional columns whenever they can satisfy your needs and avoid using serialized data. Especially if you want to use only some parts of it for any kind of calculation.

When can you store serialized data: When it is not important to enforce the consistency and you will NEVER use it for statistical queries or for filtering. (However, in most cases, the never part is not true :))

like image 24
Pred Avatar answered Oct 14 '22 18:10

Pred