Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automated normalization of mySQL database - how to do it?

I have a mySQL database filled with one huge table of 80 columns and 10 million rows. The data may have inconsistencies.

I would like to normalize the database in an automated and efficient way.

I could do it using java/c++/..., but I would like to do as much as possible inside the database. I guess that any work outside the database will slow down things very much.

Suggestions on how to do it? What are good resources/tutorials to start with?

I am not looking for any hints on what normalization is (found plenty of this stuff using google)!

like image 323
CL23 Avatar asked Jul 22 '09 18:07

CL23


People also ask

How do I normalize a MySQL database?

In the database normalization process, there are series of rules called Normal Forms. There are mainly 6 types of Normal Forms: First Normal Form (1NF), second Normal Form (2NF), Third Normal Form (3NF), Fourth Normal Form (4NF), Fifth Normal Form (5NF), and Boyce Codd Normal Form (BCNF).

What are the four 4 types of database normalization?

First Normal Form (1 NF) Second Normal Form (2 NF) Third Normal Form (3 NF) Boyce Codd Normal Form or Fourth Normal Form ( BCNF or 4 NF)

How is normalization done in SQL?

Normalization is the process to eliminate data redundancy and enhance data integrity in the table. Normalization also helps to organize the data in the database. It is a multi-step process that sets the data into tabular form and removes the duplicated data from the relational tables.


3 Answers

You need to study the columns to identify 'like' entities and break them out into seperate tabels. At best an automated tool might identify groups of rows with identical values for some of the columns, but a person who understood the data would have to decide if those truely belong as a seperate entity.

Here's a contrived example - suppose your columns were first name, last name, address, city, state, zip. An automated tool might identify rows of people who were members of the same family with the same last name, address, city, state, and zip and incorrectly conclude that those five columns represented an entity. It might then split the tables up:

First Name, ReferenceID

and another table

ID, Last Name, Address, City, State, Zip

See what i mean?

like image 113
n8wrl Avatar answered Oct 23 '22 13:10

n8wrl


I can't think of any way you can automate it. You would have to create the tables that you want, and then go through and replace each piece of data with manual queries.

e.g.,

INSERT INTO contact
SELECT DISTINCT first_name, last_name, phone
FROM massive_table;

then you could drop the columns out of the massive table and replace it with a contact_id column.

You would have a similar process when pulling out rows that go into a one-to-many table.

like image 39
Brian Ramsay Avatar answered Oct 23 '22 14:10

Brian Ramsay


In cleaning up messy data, I like to create user defined mysql functions to do typical data-scrubbing stuff... that way you can reuse them later. Approaching this way also lets you see if you can find existing udf's that have been written which you can use (with or without modification)... for example mysqludf.org

like image 2
codemonkey Avatar answered Oct 23 '22 15:10

codemonkey