Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize data efficently while INSERTing into SQL table (Postgres)

I want to import a large log file into (Postgres-)SQL

Certain string columns are very repetitive for example column 'event_type' has 1 of 10 different string values.

I have a rough understanding of normalizing data.

Firstly, is it correct to assume that : It's beneficial (for storage size and indexing and query speed) to store event_type in a separate table (possibly with a foreign key relation)?

In order to normalize I would have to check for the distinct values of event_type in the raw log and insert them into the event_types table.

There are many field types like event_types.

So Secondly: Is there a way to tell the database to create and maintain this kind of table when inserting the data?

Are there other strategies to accomplish this? I'm working with pandas.

like image 485
Cilvic Avatar asked May 17 '14 06:05

Cilvic


People also ask

Does normalization improve query performance?

Normalization is a valuable tool in ensuring we don't have redundant data (which becomes a real problem if the two redundant areas get out of sync). It doesn't generally increase performance.

What is 1NF 2NF and 3NF?

Following are the various types of Normal forms: A relation is in 1NF if it contains an atomic value. 2NF. A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional dependent on the primary key. 3NF. A relation will be in 3NF if it is in 2NF and no transition dependency exists.

Should I scale or normalize data?

In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes.


1 Answers

This is a typical situation when starting to build a database from data hitherto stored otherwise, such as in a log file. There is a solution - as usual - but it is not a very fast one. Perhaps you can write a log message handler to process messages as they come in; provided the flux (messages/second) is not too large you won't notice the overhead, especially if you can forget about writing the message to a flat text file.

Firstly, on the issue of normalization. Yes, you should always normalize and to the so-called 3rd Normal Form (3NF). This basically implies that any kind of real-world data (such as your event_type) is stored once and once only. (There are cases where you could relax this a little and go to 2NF - usually only when the real-world data requires very little storage, such as an ISO country code, a M/F(male/female) choice, etc - but in most other cases 3NF will be better.)

In your specific case, let's say that your event_type is a char(20) type. Ten such events with their corresponding int codes easily fit on a single database page, typically 4kB of disk space. If you have 1,000 log messages with event_type as a char(20) then you need 20kB just to store that information, or five database pages. If you have other such items in your log message then the storage reduction becomes correspondingly larger. Other items such as date or timestamp can be stored in their native format (4 and 8 bytes, respectively) for smaller storage, better performance and increased functionality (such as comparing dates or looking at ranges).

Secondly, you cannot tell the database to create such tables, you have to do that yourself. But once created, a stored procedure can parse your log messages and put the data in the right tables.

In the case of log messages, you can do something like this (assuming you want to do the parsing in the database and thus not in python):

CREATE FUNCTION ingest_log_message(mess text) RETURNS int AS $$
DECLARE
  parts  text[];
  et_id  int;
  log_id int;
BEGIN
  parts := regexp_split_to_array(mess, ','); -- Whatever your delimiter is

  -- Assuming:
  --   parts[1] is a timestamp
  --   parts[2] is your event_type
  --   parts[3] is the actual message

  -- Get the event_type identifier. If event_type is new, INSERT it, else just get the id.
  -- Do likewise with other log message parts whose unique text is located in a separate table.
  SELECT id INTO et_id
  FROM event_type
  WHERE type_text = quote_literal(parts[2]);
  IF NOT FOUND THEN
    INSERT INTO event_type (type_text)
    VALUES (quote_literal(parts[2]))
    RETURNING id INTO et_id;
  END IF;

  -- Now insert the log message
  INSERT INTO log_message (dt, et, msg)
  VALUES (parts[1]::timestamp, et_id, quote_literal(parts[3]))
  RETURNING id INTO log_id;

  RETURN log_id;
END; $$ LANGUAGE plpgsql STRICT;

The tables you need for this are:

CREATE TABLE event_type (
  id        serial PRIMARY KEY,
  type_text char(20)
);

and

CREATE TABLE log_message (
  id        serial PRIMARY KEY,
  dt        timestamp,
  et        integer REFERENCES event_type
  msg       text
);

You can then invoke this function as a simple SELECT statement, which will return the id of the newly insert log message:

SELECT * FROM ingest_log_message(the_message);

Note the use of the quote_literal() function in the function body. This has two important functions: (1) Quotes inside the string are properly escaped (so that words like "isn't" don't mess up the command); and (2) It guards against SQL-injection by malicious generators of log messages.

All of the above obviously needs to be tailored to your specific situation.

like image 157
Patrick Avatar answered Oct 04 '22 20:10

Patrick