Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing/validation for international data sets in a database?

Lets say you are dealing with your normal contact database (you know... name, phone number, address, email, etc...). If you're concerened about this locally, it's generally not a big issue to deal with, but when we look at international sets it is.

Looking at the phone number system, you would think it's simple, but it's really not. In north america, we generally have 1-222-333-4444 format for calling people. This is of course divieded down into your international dialing code, area code, exchange prefix and line number. Problem: real phone numbers are limited, there are around 220 area codes in the US out of the potential 1000, each area code only has a limited number of exchanges, and the line numbers are restricted to specific use under that country (for example, patterns with 911 are restricted, only about 3/4ths of the 10,000 are in use). Take this over to the UK, they have their own set of rules for line numbers, such as reserving most of the 0300-0399 block to specific use, and other restrictions. International codes are also limited. Normalizing area codes, exchanges, and putting data validation checks onto phone numbers just got complicated. I'm not going into detail about when we go into places that are not part of the NPA scheme, but lets just identify that we can't really trust the north american template, kick back, and call it a day.

How do we normalize for things like this? How do we validate data? How do we deal with these seemingly ad-hoc extension codes or instructions for internal dialing?

International addresses are not much better, the differences between not just data retained, but also output formats aren't the same across the board. How do we deal with international postal codes, when in canada the format is A1A1A1, and the USA has a system such as 55555[-4444]?

I'm tempted to just write classes for each of these situations as I encounter them, store them in the database as XML/JSON/similar, but then how do I relate fields and easily search my content? I don't want to end up creating a table moster of thousands of tables for each country. I want an easily scalable solution where I can normalize my addresses and validate the content. Is this too much to ask?

like image 548
Incognito Avatar asked Sep 27 '10 15:09

Incognito


People also ask

What are the three reasons to normalize a database?

There are three main reasons to normalize a database. The first is to minimize duplicate data, the second is to minimize or avoid data modification issues, and the third is to simplify queries.

When might you not fully normalize a database?

In addition to performance, one more reason for not fully normalizing might be if you have a certain "fuzziness" in your data. As far as I understand1, ZIP may be specific to a city block or area, which means an especially long street could have more than one ZIP.


1 Answers

A way of approaching this problem may be:

adopt three views of addresses/phone numbers/post codes etc.

  1. The first view is that of addresses (say) as multiple lines of text.

  2. The second view is that of address tags (more on this below).

  3. The third view is that of validation rules for addresses

The other components that are needed for this approach are a generic procedure (class/trigger) for validation; a formatting routine for printing purposes; and a rule base, together with an administrators mechanism to update the validation rules. A "catchall" rule that says this is a valid address - it has been validated manually - even if it meets none of the rules of your rule base.

The components: 1 Address is made up of multiple lines each of which has an associated sequence number and a number of tags (usually one). It is possible to also associate with an address line, the set of rules and versions of rules that they were validated against, but this is a refinement that depends on your update/insert/calculation rates.

2 The address tags are things like city; town; house number; and identify the different lines of an address. It is possible to have an address line that does not have any tags, but in this case, only generic searches (eg for New York) are possible on the full set of lines. Searches for "City = New York") are not possible.

3 Rules written in a domain specific language. This could be regular expression; a language specific to you; your normal development language (though this is likely to be the least useful approach, as programming languages can find it difficult to consisely and accurately represent rules of the sort I am talking about. An example of representative rules might be (dealing with your description of US Zip Codes) - the first five characters must be digits.

the first five characters represent an "area code".

the zip code must be the last line of an address.

The rules will be divided into groups and sets (eg US addresses) Rules must be able to refer to other rules as well as the address data.

  1. Your validation routine needs to take the lines of an address and apply the rules to it (usually by set). It will return a boolean - valid or invalid address and optionally the set of rules it validated against.

  2. The printing routine again will apply the appropriate rule set (probably different from your validation set) to the address data to provide a formatted address.

I hope the other componenets are obvious from the overall approach.

This approach is intended to deal with these issues identified in your question:

  1. Partitioning of phone codes.

  2. Limited number of possible area codes in use.

  3. Blocks of phone numbers reserved for specific purposes.

  4. Normalisation of data - the data is normalised. However, this type of normalisation (reverse indexing) is not generally used, except in data warehouse software, and databases containing massive real time sensor information. It is possible that in implementing this solution you may end up choosing to (controllably) duplicate data. This is not an inherent part of the solution, but may be convenient.

  5. I would strongly suggest NOT adding classes for each variant - this is not scalable, nor maintainable.

  6. Searching is covered in detail below.

  7. Monster tables are avoided - the rule base is likely to be or the order of hundreds to low thousands of rules additional to the actual data.

  8. The solution is scalable - one simply adds or amends rules.

and also deal with some related problems.

  1. Even if you can apply validation rules to national formats of addresses, there will always be exceptions to the standards for the particular country. My own address is an example - I live on a boat, and need additional information included in my address, over and above the Post Office standard address. Anomalies of this sort are always likely to need manual intervention - hence the rule for accepted by manual intervention.

  2. Different countries have different orderings for addresses - addresses in China for instance are written: Country; Post Code; City; City Zone; Street Name; House Number; Person Name.

  3. Acceptance of the first address from an area where you have no rules, and the rules of the country are different from any you have recorded.

  4. People wanting to use (eg an office) address different from "their" address.

  5. Specific individual addressing concerns - somebody wishing to conceal their correspondence from their nearest and dearest.

  6. The solution can be extended to like problems - eg referring to a person. This can involve titles - Dr, Rev, etc; multiply hyphenated names (ffoulkes-symthe); qualifications (CPA, BSc, etc; and familial qualifications (the third, etc); and multiple modes of naming depending on the persons culture (people from the Indian subcontinent often do not have a family name, and each will apparently have a different surname).

  7. Changes to addressing rules (eg addition of a new Zip Code for a new development) can be made easily and quickly.

Problems still arising are:

  1. Searches will be somewhat more complicated - needing to search all lines and associated tags of addresses rather than specific address fields

  2. It takes time to build up a rule base of this sort - whilst initial rule loading can be done fairly quickly - examples are present in your question - anything like a complete set will only be present after you have dealt with multiple exceptions and anomalies.

EDIT:

I wasn't aware of BigTable when I wrote my response. Having had a very quick look at this, it does seem to be a very similar concept. As for implementing it in a ACID development I think this is possible for the contact dataset, and for the rule data base. From experience I do know it scales easily to 5*10^7 sets of contact details in such an environment (though with background, non-time critical validation of contact details).

In considering the ACID/SQL case my use of the word "view" may have started a hare going off in a direction I did not intend. A more appropriate word may have been overview or outlook (something with no relational model or DBMS freight attached). In fact I would consider each of the things I referred to as a view, as being a candidate table.

I set out a sketch of a schema for my approach to assist discussion. This schema uses some M:N connections, which would obviously be normalised out in an implementation as associative tables. This is left as an exercise for the reader :-). I have put a few sample attributes and data into some of the tables.

alt text

In the sketch, editing of Rule Set; Rule; and rules relating to other rules (the administrative application) can obviously be done with the ACID properties by SQL Normal CRUD operations on Address User; Address; and Address Line can equally be done in an ACID/SQL manner, taking the address lines as input by the user. You may wish to do some preprocessing of the transaction to separate out (in my example) House Number from Road Name and consequentially lose the "Road Name" rule. Other pre-processing that is likely to be useful is the standardisation of capitalisation - though this could also be part of the validation step. It is also possible to simply accept the complete line "9 Richmond Road" as input, and tag it as "Needs Validation". In either case there is no problem (that I am aware of) about making this transaction ACID.

Where I am less clear is how the validation and subsequent tagging can be incorporated into the ACID/SQL transaction. It does seem that to make the validation step ACID, it may be necessary to order the application of the testing against the rule sets (with the most common cases tested first, and stopping when a validation is successful). Also it may be necessary, in order to make the transaction ACID, to validate only against your most common case, leaving the others tagged as "Need Validation", which then would be done as a background task.

The actual task of validation involves checking the entire rule base, set by set, until a set of rules is found which validates all the input lines. There are obviously potential guidelines from the data itself - if you have recorded country, then this both enables you to tag the line with country in straight away, and provides a filter for the rule sets you have to test against. I am sorry but this is as far as I can take this aspect so far.

Incidentally, this sketch schema only goes part way towards total normalisation. As drawn, each address has its own sequence of address lines, and there is no normalisation of the data beyond the individual lines as input (plus or minus any pre-processing you choose to do). It is possible to take this approach even further - by making the link between Address and Address Line M:N and ensuring that the Line field of the Address Line table is its primary key.

When it comes to more detailed resources for this concept I have two problems.

The first (and trivial) one is that this concept is my original work, based on my experience over more than twenty years as a method consultant and IT Strategy consultant with a special interest in development environment technology and development methods. All my working life has been spent in environments where contact details have been a major concern (for financial and regulatory/legislative reasons). In fact my original response to your question was complete in my mind before I had finished reading your question, even though it did then take me about three-quarters of an hour to get it typed up.

The more important reason is that some of the sources of this idea are confidential or secret. In my working career, part of my job involved keeping up to date with technology developments, and predicting the impact of technology on the business in ten years time. This involved visiting research labs and having discussions with leading researchers about a variety of topics. While I am not, myself, a first class researcher, I do seem to be very good at synthesising the research efforts of others. However, while doing this I was always operating under conditions of commercial confidentiality and/or military secrecy. None of my answer has breached these conditions. As a result of this I can only give vague guidelines as to how the information was derived.

My sources are:

  1. a seminar conducted by C J Date, at IBM, exploring the further reaches of normalisation and the Relational Model (NOT the Relational Model as implemented in SQL). This involved exploration of fifth (?) and sixth (?) normal forms.

  2. a series of discussions with Oracle technical staff over a period of time, discussing meta data; meta meta data; and other such generalisations.

  3. discussions with a UK based military research establishment. Though this was some years ago, I am not sure whether anything has ever been published on the topics we were discussing.

  4. working in a large financial institution whose contact details system was shaped much like my proposal, but arose from non-relational roots; and for which the original technical impetus was space saving in an era when memory, persistent memory, and back up capacity was a major concern.

EDIT:

I completed the above edit, shut up my computer, did some household chores, went to bed, ley down, closed my eyes, and I had the solution to the part I could not complete in that previous edit.

While there are updates to be done when tagging/validating much of the work is actually reading and comparison. As such it is a prime candidate for optimistic locking. In pseudo code this would look like:

Start read transaction
  Read set of address lines where one or more lines are "Needs Validation"
  Read all rule set names
  Read all rule lines which belong to the first rule set
  If read is not successful then abandon and start process again
End read transaction
Do while address not validated and not end of rule sets
  If set of address lines validates against first rule set then 
    prepare transaction by allocating the tags to be applied to each line of the 
    address and temporarily recording the rule set that has validated the address
    Start validation transaction
      Read same set of address lines and rule set (all fields)
      Is the address and the rule set identical to what was obtained in the read
      transaction?
      If identical then
        create appropriate set of Tag Usage records
        if successful then 
          commit validation transaction
          end overall process
        if not successful or 
        if not identical then
          rollback validation Tag and Tag Usage 
            ** This is the point where there might be problems if the rule set has 
            changed since reading it. The ordering may have changed due to amendment 
            deletion or creation of rule sets. This is the reason for the read of all 
            the read set names initially. However if there is such a change one can 
            afford to fail to validate and move on, aiming to come back to this address 
            later.
    End of validation transaction
    Start read transaction
      Read rule lines which belong to the next rule set
    End read transaction
End while loop
If end of rule sets reached and address not validated
  begin create Tag Usage transaction
    create tag usage pointing to Tag "Manual Intervention needed"
  end create Tag Usage transaction

All three types of transaction (Address, Address Line updates, Rule Set, Rule updates, and Tag, Tag Usage updates) fulfil the ACID conditions. I strongly believe (but have not proved) that any interleaving, combination, or intersecting set of these three types of transaction will also fulfill the ACID conditions.

like image 179
Chris Walton Avatar answered Sep 28 '22 04:09

Chris Walton