I have a mySQL database filled with one huge table of 80 columns and 10 million rows. The data may have inconsistencies.
I would like to normalize the database in an automated and efficient way.
I could do it using java/c++/..., but I would like to do as much as possible inside the database. I guess that any work outside the database will slow down things very much.
Suggestions on how to do it? What are good resources/tutorials to start with?
I am not looking for any hints on what normalization is (found plenty of this stuff using google)!
In the database normalization process, there are series of rules called Normal Forms. There are mainly 6 types of Normal Forms: First Normal Form (1NF), second Normal Form (2NF), Third Normal Form (3NF), Fourth Normal Form (4NF), Fifth Normal Form (5NF), and Boyce Codd Normal Form (BCNF).
First Normal Form (1 NF) Second Normal Form (2 NF) Third Normal Form (3 NF) Boyce Codd Normal Form or Fourth Normal Form ( BCNF or 4 NF)
Normalization is the process to eliminate data redundancy and enhance data integrity in the table. Normalization also helps to organize the data in the database. It is a multi-step process that sets the data into tabular form and removes the duplicated data from the relational tables.
You need to study the columns to identify 'like' entities and break them out into seperate tabels. At best an automated tool might identify groups of rows with identical values for some of the columns, but a person who understood the data would have to decide if those truely belong as a seperate entity.
Here's a contrived example - suppose your columns were first name, last name, address, city, state, zip. An automated tool might identify rows of people who were members of the same family with the same last name, address, city, state, and zip and incorrectly conclude that those five columns represented an entity. It might then split the tables up:
First Name, ReferenceID
and another table
ID, Last Name, Address, City, State, Zip
See what i mean?
I can't think of any way you can automate it. You would have to create the tables that you want, and then go through and replace each piece of data with manual queries.
e.g.,
INSERT INTO contact
SELECT DISTINCT first_name, last_name, phone
FROM massive_table;
then you could drop the columns out of the massive table and replace it with a contact_id column.
You would have a similar process when pulling out rows that go into a one-to-many table.
In cleaning up messy data, I like to create user defined mysql functions to do typical data-scrubbing stuff... that way you can reuse them later. Approaching this way also lets you see if you can find existing udf's that have been written which you can use (with or without modification)... for example mysqludf.org
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With